Testing llama.cpp MTP support on Qwen3.6 - RTX 5090
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Setup: - RTX 5090, 32 GB, Linux - Built llama.cpp from 4f13cb7 (the official ghcr.io/ggml-org/llama.cpp:server-cuda image hasn't picked up the merge yet as of writing — had to docker build from source with CUDA_DOCKER_ARCH=120) - Unsloth's Qwen3.6-27B-MTP-GGUF Q5_K_M and Qwen3.6-35B-A3B-MTP-GGUF UD-Q4_K_M - 128k context, flash-attn, q8_0 KV cache, temp 0.8, --parallel 1 (required for MTP) - Same GGUF for "MTP on" and "MTP off" — only the --spec-type draft-mtp --spec-draft-n-max 3 flag toggled. This isolates MTP from quant differences. - 2 prompts: "short story about a cat" (~400 tokens) and "Flappy Bird clone as a single HTML file" (~3000 tokens) - 3 seeds per config, averaged [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.