MTP has no impact on my Qwen3.6 MoE performance
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hello I have an rtx 5060Ti and I tried running unsloth's Qwen3.6-35B GGUF with MTP. However in both cases I have around 60 tok/s.
Here are my flags:
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --alias unsloth/Qwen3.6 --port 8002 --kv-unified --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --fit on --no-mmproj --ctx-size 64000 For the MTP variant of course I add the following as per the unsloth guide.
--spec-type draft-mtp --spec-draft-n-max 2 --presence-penalty 1.5
I tried to reduce the ctx size, remove cache quantization, add `--no-mmap` and although the speed changes slightly, it remains the same between MTP/non MTP. I thought it was supposed to offer a speedup.
Anybody has an idea why?
[link] [comments]
More from r/LocalLLaMA
-
You guys were right - Qwen 3.6 35B IS good...and KV Cache DOES matter.
Jun 4
-
Run (your largest) local models from your iPhone
Jun 4
-
Nemotron 3 Ultra. 550 billion parameters, 55B active. 1 million context
Jun 4
-
I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance
Jun 4
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.