Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
UPDATED (POST b9200)
Okay, here is the updated version using the new Qwen 3.6 27B mtp gguf from Unsloth, running it as the backend for the hermes agent. While dialing it in, I noticed that the currently recommended Unsloth mtp flags actually bottleneck performance and tank draft acceptance rates for strict, multi-turn agentic workflows. Pairing a custom config with today's brand new llama.cpp b9200 release — which specifically fixes mtp memory traffic overhead — completely turns that around.
Hardware/Software
* RTX 3090 (24GB VRAM) — currently undervolted to keep temps down
* Ryzen 7 5700G / 64GB
* Qwen3.6-27B-IQ4_NL.gguf
* llama-server (b9200+ compiled from source, commit #23234)
* hermes agent (64K context) max to limit spillover
The problem with default mtp settings
Running the standard recommended mtp flags (--spec-draft-n-max 6 and --spec-draft-p-min 0.75) gave poor results for agentic loops. Generation speeds sat around 7–8 t/s, and the mtp draft acceptance rate hovered around 22–26%.
Agent workflows are rigid. A 6-token lookahead frequently guesses the wrong punctuation, the main model rejects the draft, and the GPU throws out the math and recalculates — completely negating the mtp speed boost. Without explicitly declaring parallel slots, llama-server also defaults to 4, eating up memory bandwidth managing unused context slots.
The fix and the b9200 boost
For agent workflows on a 24GB card, limit to a single slot, drop the lookahead to 3, and remove the p-min threshold so it doesn't hesitate on rigid syntax. Combined with the b9200 release — which stops copying the full logits for every token in the batch during prompt processing — the optimized launch command looks like this:
.\build\bin\Release\llama-server.exe ^
-m D:\models\Qwen3.6-27B-IQ4_NL.gguf ^
--spec-type draft-mtp ^
--spec-draft-n-max 3 ^
--ctx-size 65536 ^
--parallel 1 ^
--flash-attn on ^
--cache-type-k q8_0 ^
--cache-type-v q8_0 ^
--port 8081
Results (Prior to the update vs. Post-b9200)
Prior to the update (but with the optimized flags):
* Prompt processing sat around ~560 t/s. ***FIXED NUMBERS...***
* Token generation hit 17.06 t/s on short tasks and ~9.5 t/s during heavy context reasoning loops.
* Draft acceptance rate climbed to 77% (proving a shorter lookahead works better for strict formatting).
After the b9200 update:
* Prompt processing stabilized around ~611 t/s. **Updated Numbers**
The real magic of the memory traffic fix paired with --parallel 1 is that it unclogs the VRAM bus so the text generation phase can actually breathe.
* Token generation hit a peak of 27.44 t/s on short tasks and stabilized at a highly usable 13.69 t/s during heavy context loops where the agent is actively switching between tool calls and main memory.
* Draft acceptance rate maintained a solid ~70% on standard turns.
When your VRAM bus isn't clogged by ghost parallel slots or 6-token lookahead rejections, an undervolted 3090 can still push nearly 30 t/s on a dense 27B model!
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.