r/LocalLLaMA · May 17, 2026 · 1 min read

Llama.cpp MTP with Qwen3.6 27B on Headless RTX 3090

#model-release #long-context

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

Saw some posts around PP being slower, so they were cautious on trying it.

Here's a real-world datapoint.

Settings:

Headless RTX 3090 24G
OpenCode
Model unsloth's Qwen3.6-27B-MTP-Q4_K_M.gguf
128k context
q8_0 kv cache
--spec-draft-n-max: 3
--draft-p-min: 0

Use Cases:

Research task that uses ~85,000 tokens
Coding task that uses ~85,000 tokens.

Without MTP (llama.cpp:server-cuda13-b9174):

PP: 1,050 tok/s
TG: 27 toks/s
Total time to complete 85k tokens: ~39 mins

With MTP (latest master fork):

PP: 600 tok/s (down 42%)
TG: 50 tok/s (up 85%)
Total time to complete 85k tokens: ~23 mins (1.7x faster or 41% reduction)

A 41% time savings is quite huge, so unless you're PP heavy, I'd recommend giving MTP a try on your use cases! I have it on a dual agent set-up so your total processing times may be better since I have another critic agent check the main agent's work.

submitted by /u/cleversmoke
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA