r/LocalLLaMA · · 1 min read

MTP has no impact on my Qwen3.6 MoE performance

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hello I have an rtx 5060Ti and I tried running unsloth's Qwen3.6-35B GGUF with MTP. However in both cases I have around 60 tok/s.

Here are my flags:

llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --alias unsloth/Qwen3.6 --port 8002 --kv-unified --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn on --fit on --no-mmproj --ctx-size 64000 

For the MTP variant of course I add the following as per the unsloth guide.

--spec-type draft-mtp --spec-draft-n-max 2 --presence-penalty 1.5

I tried to reduce the ctx size, remove cache quantization, add `--no-mmap` and although the speed changes slightly, it remains the same between MTP/non MTP. I thought it was supposed to offer a speedup.

Anybody has an idea why?

submitted by /u/redblood252
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA