Worse quality with MTP - Qwen 3.6, Gemma 4
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hi.
I am self-hosting Qwen 3.6 27B Q8_K_XL with Llama.cpp on 4x5070ti.
(All 4 cards are on single x16 slot bifurcated to 4x4 with risers).
I've been testing it on several work repos with Opencode CLI and in like 8/10 situations the output of non-MTP model is far superior to the MTP ones.
The prompt is simple `Do a code review of this branch.`.
The non MTP produces more findings, with more detailed descriptions, with fix suggestion snippets, everything is better. Usually takes fewer tokens also (for example like ~40k for non MTP vs ~60k for MTP).
And real life speed is not so great either:
- The non-MTP for me is like ~2000 pp/s and ~50-60 tg/s.
- The MTP is like ~1300 pp/s and ~100-120 tg/s.
So while MTP has double TG numbers, the real life agent tasks are like within 20% of time taken when comparing MTP vs Non MTP.
I do not understand what I am doing wrong - everyone swears that MTP is like free performance with same quality, but for me the MTP degrades output, needs more VRAM (that I expected before ofc), consumes more context...
My settings
__Qwen MTP__ (file from https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF)
```bash
exec /opt/llama.cpp/build-cuda/bin/llama-server \
--host 0.0.0.0 \
--port 8081 \
--alias Qwen3.6-27B \
--model /opt/models/qwen36/27b/unsloth/Qwen3.6-27B-UD-Q8_K_XL.gguf \
--ctx-size 262144 \
--device CUDA0,CUDA1,CUDA2,CUDA3 \
--fit off \
--split-mode tensor \
--tensor-split 1,1,1,1 \
--gpu-layers all \
--flash-attn on \
--kv-offload \
--cache-type-k f16 \
--cache-type-v f16 \
--batch-size 4096 \
--ubatch-size 1024 \
--parallel 1 \
--jinja \
--top-p 0.95 \
--top-k 20 \
--temp 0.6 \
--min-p 0.00 \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--no-cache-idle-slots \
--cache-ram 32768 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--mmproj /opt/models/qwen36/27b/unsloth/mmproj-BF16.gguf \
--image-min-tokens 1024 \
--cache-prompt \
--ctx-checkpoints 128 \
--checkpoint-min-step 512 \
--cache-reuse 512 \
--cache-idle-slots \
--no-context-shift \
--no-kv-unified \
--slot-prompt-similarity 0.10 \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking":true}' \
--no-mmproj-offload
```
For __Qwen Non MTP__ (file from https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)
the only thing that differs is:
```bash
--model /opt/models/qwen36/27b/unsloth/Qwen3.6-27B-UD-NoMTP-Q8_K_XL.gguf
# missing --spec-type and --spec-draft-n-max flags
```
Also tried https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF with the similar experience comparing MTP and non-MTP.
Anyone had the similar experience?
P.S. I'll add some examples on some OSS repos perhaps with llama.cpp logs, when I got home.
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.