r/LocalLLaMA · June 30, 2026 · 5 min read

Qwen 3.6 27B Speculative Decoding Bench: Pushing ~100 TPS on a single RTX 3090

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

First of all, a huge thank you to the r/LocalLLaMA community and the 3090 club. This benchmark started from your shared recipes...

These are my findings on my hardware (Xeon E5-2666v3, 64GB RAM, single RTX 3090 24GB) comparing 5 engines (3 llama.cpp forks + mainline + Lucebox) across two quantizations of the same model.

I've used the bench script from https://github.com/noonghunna/club-3090/tree/master and two simple scripts using en8wiki for building long prompts.

Summary Table

Sorted by fork → speculative type. Key metrics: decode_TPS (code & narrative), TTFT, VRAM usage, and context consistency (generation speed degradation when moving from 72k to 128k filled context).

Fork / Engine	Speculative Type	Model / Quant	Code TPS	Narr. TPS	TTFT	VRAM (MiB)	Gen 72k	Gen 128k	Deg. (72k→128k)
ik_llama (ubergarm config)	MTP `n_max=4`	Qwen3.6-27B-IQ4_KS	89.2	63.9	361ms	22304	34.6	23.5	−32.1%
ik_llama + ngram	ngram+MTP	Qwen3.6-27B-IQ4_KS	87.8	58.6	341ms	20508	32.1	24.1	−24.9%
ik_llama (Standard config)	MTP `n_max=2`	Qwen3.6-27B-IQ4_KS	73.1	61.7	357ms	20208	33.8	25.4	−24.8%

mainline llama.cpp	MTP `n_max=1`	Qwen3.6-27B-Q4_K_M	64.7	52.5	288ms	21354	33.4	31.2	−6.6%
Spiritbuun	MTP	Qwen3.6-27B-Q4_K_M	59.7	45.7	294ms	22066	34.8	31.5	−9.5%
beellama	DFlash (Draft GGUF)	Qwen3.6-27B-Q4_K_M	96.8	45.6	504ms	20814	22.9*	27.1	−41.3%**
Spiritbuun	DFlash	Qwen3.6-27B-Q4_K_M	66.9	30.4	300ms	23356	—	—	—
LUCEBOX	DFlash (TQ3 KV)	Qwen3.6-27B-Q4_K_M	32.6	32.5	448ms	20680	27.0	—	—

ik_llama — The fork that does "everything"

Fork of llama.cpp with native MTP support, merge-qkv, recurrent checkpoints, and multi-backend speculative decoding. Tested on IQ4_KS quant (by ubergarm).

ik_llama + MTP+ngram (ngram-mod + mtp)

Great code generation. Combines ngram drafts (n_max=4, size 16) with MTP (n_max=3). Code hits 87.8 decode tokens/sec — a massive jump over mainline.

VRAM: 20508 MiB (82% GPU utilization)
Context degradation: −25% (32.1→24.1 gen_tps). Notable drop when context fills.

ik_llama + MTP (ubergarm tuned config)

Best narrative speed: 63.9 DP, highest in the entire benchmark. Code sits at 89.2 DP.

Extra config: -muge --merge-qkv -mtprot iq4_ks -cram 32768 --slot-save-path /root/slot --ctx-checkpoints 32
VRAM: 22304 MiB. Higher VRAM due to slot checkpoints.
Context degradation: −32% (34.6→23.5). Worst drop across all setups.

ik_llama + MTP (Standard Config)

The baseline for native MTP. Running with standard parameters (n_max=2) without ubergarm's recommended tweaks or the hybrid ngram module. It delivers a balanced 73.1 DP in code and 61.7 DP in narrative.

VRAM: 20208 MiB.
Context degradation: −25% (33.8→25.4 gen_tps).

ik_llama + DFlash

Tested with beellama's independent draft model. Code 96.8 DP, competitive with MTP+ngram, but narrative suffers heavily (45.7 DP). TTFT is high (504ms) due to separate draft model loading??.

mainline llama.cpp — The Reference

No forks, no patches. Upstream speculative MTP. Standard Q4_K_M quantization.

Code: 64.7 DP | Narrative: 52.6 DP
TTFT: 288ms — lowest across the board, zero overhead
Context consistency: 0% degradation (31.3→31.3 DP between 72k and 128k). This matters: mainline maintains speed regardless of context length (or maybe an outlier?)

It’s not the fastest in raw throughput, but it’s the most predictable.

Spiritbuun — Optimized MTP, Failed DFlash

Spiritbuun MTP

Fork with optimized MTP (turbo cache, flash-attn). Q4_K_M quantization.

I tested this because it gave me the best results with the Qwen 3.6 35B A3B MoE model, paired with APEX quants (see my post about it if you are interested).

Code: 59.7 DP | Narrative: 45.7 DP
Context degradation: −9%. Best consistency after mainline.
TTFT: 294ms — nearly identical to mainline

Spiritbuun DFlash

Tested with its own draft model. Failed to reach MTP speeds: 67.0 DP code, 30.4 DP narrative. I didn't test long context performance, it didn't seem worth it.

beellama DFlash — Brutal Code Speed, High TTFT Cost

Uses own draft model (anbeeld-Qwen3.6-27B-DFlash-IQ4_XS.gguf) with cross-ctx 1024 and unified KV.

Code: 96.8 DP — second best overall, very close to ik_llama
Narrative: 45.7 DP
Drawback: 504ms TTFT (nearly double mainline). First word takes half a second.
VRAM: 20814 MiB. Moderate GPU usage (73%).
Context: 128k holds 27.1 DP. Better than ik_llama MTP in long context.

LUCEBOX DFlash — Not working for me

Independent server engine with DFlash, TQ3 KV cache, and PFlash!

Code: 32.7 DP | Narrative: 32.5 DP
Worse than running without speculative decoding in many cases

Maybe I didn't understand how to use it consistently? The env's I've used in my incus container:

 environment.DFLASH_FP_USE_BSA: "1" environment.DFLASH_HOST: 0.0.0.0 environment.DFLASH_KVFLASH: auto environment.DFLASH_PORT: "8080" environment.DFLASH_PREFILL_DRAFTER: /opt/lucebox-hub/server/models/unsloth-Qwen3-0.6B-BF16.gguf environment.DFLASH_PREFILL_MODE: auto environment.DFLASH_SERVER_BIN: /opt/lucebox-hub/server/build/dflash_server environment.DFLASH_TARGET: /opt/lucebox-hub/server/models/Qwen3.6-27B-Q4_K_M.gguf environment.DFLASH27B_KV_TQ3: "1"

Consistency Verdict

If we rank purely by real-world consistency (speed stability across context lengths + low TTFT + low VRAM overhead):

mainline llama.cpp MTP — The clear winner for consistency. Almost zero degradation between 72k and 128k. Lowest TTFT (288ms). Stable VRAM (~21GB). No external draft model dependency. It doesn't break, doesn't spike, doesn't throttle.
Spiritbuun MTP — Only 9% degradation, TTFT 294ms, very stable. Slightly lower throughput than mainline but remarkably predictable.
LUCEBOX DFlash — Technically consistent (0.1% variance), but consistently slow. Not useful for me.
ik_llama setups — Fast in short context, but pay a heavy price in long context (−25% to −32% degradation).

My take: The differences between mainline and Spiritbuun are marginal (~3-5 DP). But mainline's zero degradation and lowest TTFT make it the most practically consistent setup. If you're running long documents or RAG pipelines, mainline won't surprise you. ik_llama wins on speed, but you're betting on short context.

Final Recommendations

Priority	Best Option	Why
Code speed	ik_llama MTP+ngram	98.5 DP, double the baseline
Narrative speed	ik_llama MTP (ubergarm)	63.9 DP
Context consistency	mainline llama.cpp	0% degradation, lowest TTFT
Balance speed + stability	Spiritbuun MTP	Near-mainline consistency with slightly better throughput
Low TTFT	mainline llama.cpp	288ms, zero overhead

What do you think?

submitted by /u/old-mike
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.