Qwen 3.6 27B Speculative Decoding Bench: Pushing ~100 TPS on a single RTX 3090
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
First of all, a huge thank you to the r/LocalLLaMA community and the 3090 club. This benchmark started from your shared recipes...
These are my findings on my hardware (Xeon E5-2666v3, 64GB RAM, single RTX 3090 24GB) comparing 5 engines (3 llama.cpp forks + mainline + Lucebox) across two quantizations of the same model.
I've used the bench script from https://github.com/noonghunna/club-3090/tree/master and two simple scripts using en8wiki for building long prompts.
Summary Table
Sorted by fork → speculative type. Key metrics: decode_TPS (code & narrative), TTFT, VRAM usage, and context consistency (generation speed degradation when moving from 72k to 128k filled context).
| Fork / Engine | Speculative Type | Model / Quant | Code TPS | Narr. TPS | TTFT | VRAM (MiB) | Gen 72k | Gen 128k | Deg. (72k→128k) |
|---|---|---|---|---|---|---|---|---|---|
| ik_llama (ubergarm config) | MTP n_max=4 | Qwen3.6-27B-IQ4_KS | 89.2 | 63.9 | 361ms | 22304 | 34.6 | 23.5 | −32.1% |
| ik_llama + ngram | ngram+MTP | Qwen3.6-27B-IQ4_KS | 87.8 | 58.6 | 341ms | 20508 | 32.1 | 24.1 | −24.9% |
| ik_llama (Standard config) | MTP n_max=2 | Qwen3.6-27B-IQ4_KS | 73.1 | 61.7 | 357ms | 20208 | 33.8 | 25.4 | −24.8% |
| mainline llama.cpp | MTP n_max=1 | Qwen3.6-27B-Q4_K_M | 64.7 | 52.5 | 288ms | 21354 | 33.4 | 31.2 | −6.6% |
| Spiritbuun | MTP | Qwen3.6-27B-Q4_K_M | 59.7 | 45.7 | 294ms | 22066 | 34.8 | 31.5 | −9.5% |
| beellama | DFlash (Draft GGUF) | Qwen3.6-27B-Q4_K_M | 96.8 | 45.6 | 504ms | 20814 | 22.9* | 27.1 | −41.3%** |
| Spiritbuun | DFlash | Qwen3.6-27B-Q4_K_M | 66.9 | 30.4 | 300ms | 23356 | — | — | — |
| LUCEBOX | DFlash (TQ3 KV) | Qwen3.6-27B-Q4_K_M | 32.6 | 32.5 | 448ms | 20680 | 27.0 | — | — |
ik_llama — The fork that does "everything"
Fork of llama.cpp with native MTP support, merge-qkv, recurrent checkpoints, and multi-backend speculative decoding. Tested on IQ4_KS quant (by ubergarm).
ik_llama + MTP+ngram (ngram-mod + mtp)
Great code generation. Combines ngram drafts (n_max=4, size 16) with MTP (n_max=3). Code hits 87.8 decode tokens/sec — a massive jump over mainline.
- VRAM: 20508 MiB (82% GPU utilization)
- Context degradation: −25% (32.1→24.1 gen_tps). Notable drop when context fills.
ik_llama + MTP (ubergarm tuned config)
Best narrative speed: 63.9 DP, highest in the entire benchmark. Code sits at 89.2 DP.
- Extra config:
-muge --merge-qkv -mtprot iq4_ks -cram 32768 --slot-save-path /root/slot --ctx-checkpoints 32 - VRAM: 22304 MiB. Higher VRAM due to slot checkpoints.
- Context degradation: −32% (34.6→23.5). Worst drop across all setups.
ik_llama + MTP (Standard Config)
The baseline for native MTP. Running with standard parameters (n_max=2) without ubergarm's recommended tweaks or the hybrid ngram module. It delivers a balanced 73.1 DP in code and 61.7 DP in narrative.
- VRAM: 20208 MiB.
- Context degradation: −25% (33.8→25.4 gen_tps).
ik_llama + DFlash
Tested with beellama's independent draft model. Code 96.8 DP, competitive with MTP+ngram, but narrative suffers heavily (45.7 DP). TTFT is high (504ms) due to separate draft model loading??.
mainline llama.cpp — The Reference
No forks, no patches. Upstream speculative MTP. Standard Q4_K_M quantization.
- Code: 64.7 DP | Narrative: 52.6 DP
- TTFT: 288ms — lowest across the board, zero overhead
- Context consistency: 0% degradation (31.3→31.3 DP between 72k and 128k). This matters: mainline maintains speed regardless of context length (or maybe an outlier?)
It’s not the fastest in raw throughput, but it’s the most predictable.
Spiritbuun — Optimized MTP, Failed DFlash
Spiritbuun MTP
Fork with optimized MTP (turbo cache, flash-attn). Q4_K_M quantization.
I tested this because it gave me the best results with the Qwen 3.6 35B A3B MoE model, paired with APEX quants (see my post about it if you are interested).
- Code: 59.7 DP | Narrative: 45.7 DP
- Context degradation: −9%. Best consistency after mainline.
- TTFT: 294ms — nearly identical to mainline
Spiritbuun DFlash
Tested with its own draft model. Failed to reach MTP speeds: 67.0 DP code, 30.4 DP narrative. I didn't test long context performance, it didn't seem worth it.
beellama DFlash — Brutal Code Speed, High TTFT Cost
Uses own draft model (anbeeld-Qwen3.6-27B-DFlash-IQ4_XS.gguf) with cross-ctx 1024 and unified KV.
- Code: 96.8 DP — second best overall, very close to ik_llama
- Narrative: 45.7 DP
- Drawback: 504ms TTFT (nearly double mainline). First word takes half a second.
- VRAM: 20814 MiB. Moderate GPU usage (73%).
- Context: 128k holds 27.1 DP. Better than ik_llama MTP in long context.
LUCEBOX DFlash — Not working for me
Independent server engine with DFlash, TQ3 KV cache, and PFlash!
- Code: 32.7 DP | Narrative: 32.5 DP
- Worse than running without speculative decoding in many cases
Maybe I didn't understand how to use it consistently? The env's I've used in my incus container:
environment.DFLASH_FP_USE_BSA: "1" environment.DFLASH_HOST: 0.0.0.0 environment.DFLASH_KVFLASH: auto environment.DFLASH_PORT: "8080" environment.DFLASH_PREFILL_DRAFTER: /opt/lucebox-hub/server/models/unsloth-Qwen3-0.6B-BF16.gguf environment.DFLASH_PREFILL_MODE: auto environment.DFLASH_SERVER_BIN: /opt/lucebox-hub/server/build/dflash_server environment.DFLASH_TARGET: /opt/lucebox-hub/server/models/Qwen3.6-27B-Q4_K_M.gguf environment.DFLASH27B_KV_TQ3: "1" Consistency Verdict
If we rank purely by real-world consistency (speed stability across context lengths + low TTFT + low VRAM overhead):
- mainline llama.cpp MTP — The clear winner for consistency. Almost zero degradation between 72k and 128k. Lowest TTFT (288ms). Stable VRAM (~21GB). No external draft model dependency. It doesn't break, doesn't spike, doesn't throttle.
- Spiritbuun MTP — Only 9% degradation, TTFT 294ms, very stable. Slightly lower throughput than mainline but remarkably predictable.
- LUCEBOX DFlash — Technically consistent (0.1% variance), but consistently slow. Not useful for me.
- ik_llama setups — Fast in short context, but pay a heavy price in long context (−25% to −32% degradation).
My take: The differences between mainline and Spiritbuun are marginal (~3-5 DP). But mainline's zero degradation and lowest TTFT make it the most practically consistent setup. If you're running long documents or RAG pipelines, mainline won't surprise you. ik_llama wins on speed, but you're betting on short context.
Final Recommendations
| Priority | Best Option | Why |
|---|---|---|
| Code speed | ik_llama MTP+ngram | 98.5 DP, double the baseline |
| Narrative speed | ik_llama MTP (ubergarm) | 63.9 DP |
| Context consistency | mainline llama.cpp | 0% degradation, lowest TTFT |
| Balance speed + stability | Spiritbuun MTP | Near-mainline consistency with slightly better throughput |
| Low TTFT | mainline llama.cpp | 288ms, zero overhead |
What do you think?
[link] [comments]
More from r/LocalLLaMA
-
Well.. it's a step up from nonstop bot spam I guess
Jun 30
-
Meta secretly tested ChatGPT, Gemini, and Character.AI with thousands of minor-perspective crisis prompts
Jun 30
-
NEW on Hugging Face: Filter by hardware compatibility
Jun 30
-
Huawei open-sources OpenPangu-2.0-Flash - 92B total,6B active
Jun 30
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.