llama.cpp - Qwen3.6/3.5-MTP - Share your benchmarks t/s
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I think the dust has settled(95+%) for Qwen3.6/3.5-MTP. After the initial PR, so much optimizations & fixes. Even sometime ago today, there's a MTP related PR got merged & released(b9495). So try this latest version & share your benchmarks t/s*. Great work by u/am17an & other folks.
* - Please share all stuff so it would be useful for others too. Also without particular missing details, benchmarks becomes inaccurate. Also I/We would like to have most optimized full command to get best t/s.
To save your time, just copy your console output with full command(has all important details like model quant, context size, KVCache, fit/ncmoe, MTP, etc.,) & paste here. Sample is below(Not mine, pasting from random thread).
llama-server \ -m ../models/Qwen3.6-35B-A3B-MTP-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 \ --ctx-size 150000 \ --flash-attn on \ -b 2048 \ -ub 512 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --jinja \ --threads 11 \ --threads-batch 11 \ -cram 12288 \ --mlock \ -fit on \ --chat-template-kwargs '{"preserve_thinking": true}' \ --spec-type mtp \ --spec-draft-n-max 3 \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ -np 1 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 prompt eval time = 128889.09 ms / 26796 tokens (4.81 ms per token, 207.90 tokens per second) eval time = 10969.17 ms / 264 tokens (41.55 ms per token, 24.07 tokens per second) total time = 139858.26 ms / 27060 tokens draft acceptance rate = 0.52614 ( 161 accepted / 306 generated) statistics mtp: #calls(b,g,a) = 6 2811 2305, #gen drafts = 2811, #acc drafts = 2305, #gen tokens = 8433, #acc tokens = 5507, dur(b,g,a) = 0.020, 41478.073, 74.975 ms [link] [comments]
More from r/LocalLLaMA
-
gemma-4-12b-it vs Qwen3.5-9B on shared benchmarks: Qwen is overall winner beating gemma in 5/8 benchmarks despite a smaller footprint
Jun 3
-
More Gemma 4 models incoming
Jun 3
-
Been a while since we had a Qwen-Coder. could use a 3.7 80B-8B
Jun 3
-
qwen35: use post-norm hidden state for MTP by am17an · Pull Request #24025 · ggml-org/llama.cpp
Jun 3
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.