r/LocalLLaMA · June 25, 2026 · 2 min read

Worse quality with MTP - Qwen 3.6, Gemma 4

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hi.
I am self-hosting Qwen 3.6 27B Q8_K_XL with Llama.cpp on 4x5070ti.
(All 4 cards are on single x16 slot bifurcated to 4x4 with risers).

I've been testing it on several work repos with Opencode CLI and in like 8/10 situations the output of non-MTP model is far superior to the MTP ones.

The prompt is simple `Do a code review of this branch.`.
The non MTP produces more findings, with more detailed descriptions, with fix suggestion snippets, everything is better. Usually takes fewer tokens also (for example like ~40k for non MTP vs ~60k for MTP).

And real life speed is not so great either:
- The non-MTP for me is like ~2000 pp/s and ~50-60 tg/s.
- The MTP is like ~1300 pp/s and ~100-120 tg/s.

So while MTP has double TG numbers, the real life agent tasks are like within 20% of time taken when comparing MTP vs Non MTP.

I do not understand what I am doing wrong - everyone swears that MTP is like free performance with same quality, but for me the MTP degrades output, needs more VRAM (that I expected before ofc), consumes more context...

My settings

__Qwen MTP__ (file from https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF)

```bash
exec /opt/llama.cpp/build-cuda/bin/llama-server \
--host 0.0.0.0 \
--port 8081 \
--alias Qwen3.6-27B \
--model /opt/models/qwen36/27b/unsloth/Qwen3.6-27B-UD-Q8_K_XL.gguf \
--ctx-size 262144 \
--device CUDA0,CUDA1,CUDA2,CUDA3 \
--fit off \
--split-mode tensor \
--tensor-split 1,1,1,1 \
--gpu-layers all \
--flash-attn on \
--kv-offload \
--cache-type-k f16 \
--cache-type-v f16 \
--batch-size 4096 \
--ubatch-size 1024 \
--parallel 1 \
--jinja \
--top-p 0.95 \
--top-k 20 \
--temp 0.6 \
--min-p 0.00 \
--spec-type draft-mtp \
--spec-draft-n-max 2 \
--no-cache-idle-slots \
--cache-ram 32768 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--mmproj /opt/models/qwen36/27b/unsloth/mmproj-BF16.gguf \
--image-min-tokens 1024 \
--cache-prompt \
--ctx-checkpoints 128 \
--checkpoint-min-step 512 \
--cache-reuse 512 \
--cache-idle-slots \
--no-context-shift \
--no-kv-unified \
--slot-prompt-similarity 0.10 \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking":true}' \
--no-mmproj-offload
```

For __Qwen Non MTP__ (file from https://huggingface.co/unsloth/Qwen3.6-27B-GGUF)
the only thing that differs is:
```bash
--model /opt/models/qwen36/27b/unsloth/Qwen3.6-27B-UD-NoMTP-Q8_K_XL.gguf
# missing --spec-type and --spec-draft-n-max flags

```

Also tried https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF with the similar experience comparing MTP and non-MTP.

Anyone had the similar experience?

P.S. I'll add some examples on some OSS repos perhaps with llama.cpp logs, when I got home.

submitted by /u/Significant_Bar_460
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA