Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Saw some posts around PP being slower, so they were cautious on trying it.
Here's a real-world datapoint.
Settings:
- Headless RTX 3090 24G
- OpenCode
- Model unsloth's Qwen3.6-27B-MTP-Q4_K_M.gguf
- 128k context
- q8_0 kv cache
- --spec-draft-n-max: 3
- --draft-p-min: 0
Use Cases:
- Research task that uses ~85,000 tokens
- Coding task that uses ~85,000 tokens.
Without MTP (llama.cpp:server-cuda13-b9174):
- PP: 1,050 tok/s
- TG: 27 toks/s
- Total time to complete 85k tokens: ~39 mins
With MTP (latest master fork):
- PP: 600 tok/s (down 42%)
- TG: 50 tok/s (up 85%)
- Total time to complete 85k tokens: ~23 mins (1.7x faster or 41% reduction)
A 41% time savings is quite huge, so unless you're PP heavy, I'd recommend giving MTP a try on your use cases! I have it on a dual agent set-up so your total processing times may be better since I have another critic agent check the main agent's work.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.