r/LocalLLaMA · · 1 min read

Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I have a 5090 power limited to 475W. When I run the following command, it barely hits 300W and I get something like 30 t/s:

bash ./llama-server \ -m ~/myp/models/unsloth_mtp_Qwen3.6-27B-UD-Q5_K_XL.gguf \ --host 0.0.0.0 \ --port 8080 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ -fit on \ -c 131072 \ -fitt 3000 \ --spec-type draft-mtp \ --spec-draft-n-max 2 \ -n -1 \ -fa on \ --repeat-penalty 1.0

But if I remove these 2 params - it shoots up to 475W and I get 70 t/s:

--spec-type draft-mtp \ --spec-draft-n-max 2 \

I tried changing spec-draft-n-max for 1,2,4 and getting the same results. I also am getting decent acceptance rate (> 50%).

My test prompt is - 1000 words like roald dahl.

What is going on? I swear this was giving me 100+ t/s until 2 days ago. I might have synced llama.cpp to head and re-compiled, but not entirely sure.

submitted by /u/BitGreen1270
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA