r/LocalLLaMA · · 5 min read

Qwen 3.6 27B Speculative Decoding Bench: Pushing ~100 TPS on a single RTX 3090

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

First of all, a huge thank you to the r/LocalLLaMA community and the 3090 club. This benchmark started from your shared recipes...

These are my findings on my hardware (Xeon E5-2666v3, 64GB RAM, single RTX 3090 24GB) comparing 5 engines (3 llama.cpp forks + mainline + Lucebox) across two quantizations of the same model.

I've used the bench script from https://github.com/noonghunna/club-3090/tree/master and two simple scripts using en8wiki for building long prompts.

Summary Table

Sorted by fork → speculative type. Key metrics: decode_TPS (code & narrative), TTFT, VRAM usage, and context consistency (generation speed degradation when moving from 72k to 128k filled context).

Fork / Engine Speculative Type Model / Quant Code TPS Narr. TPS TTFT VRAM (MiB) Gen 72k Gen 128k Deg. (72k→128k)
ik_llama (ubergarm config) MTP n_max=4 Qwen3.6-27B-IQ4_KS 89.2 63.9 361ms 22304 34.6 23.5 −32.1%
ik_llama + ngram ngram+MTP Qwen3.6-27B-IQ4_KS 87.8 58.6 341ms 20508 32.1 24.1 −24.9%
ik_llama (Standard config) MTP n_max=2 Qwen3.6-27B-IQ4_KS 73.1 61.7 357ms 20208 33.8 25.4 −24.8%
mainline llama.cpp MTP n_max=1 Qwen3.6-27B-Q4_K_M 64.7 52.5 288ms 21354 33.4 31.2 −6.6%
Spiritbuun MTP Qwen3.6-27B-Q4_K_M 59.7 45.7 294ms 22066 34.8 31.5 −9.5%
beellama DFlash (Draft GGUF) Qwen3.6-27B-Q4_K_M 96.8 45.6 504ms 20814 22.9* 27.1 −41.3%**
Spiritbuun DFlash Qwen3.6-27B-Q4_K_M 66.9 30.4 300ms 23356
LUCEBOX DFlash (TQ3 KV) Qwen3.6-27B-Q4_K_M 32.6 32.5 448ms 20680 27.0

ik_llama — The fork that does "everything"

Fork of llama.cpp with native MTP support, merge-qkv, recurrent checkpoints, and multi-backend speculative decoding. Tested on IQ4_KS quant (by ubergarm).

ik_llama + MTP+ngram (ngram-mod + mtp)

Great code generation. Combines ngram drafts (n_max=4, size 16) with MTP (n_max=3). Code hits 87.8 decode tokens/sec — a massive jump over mainline.

  • VRAM: 20508 MiB (82% GPU utilization)
  • Context degradation: −25% (32.1→24.1 gen_tps). Notable drop when context fills.

ik_llama + MTP (ubergarm tuned config)

Best narrative speed: 63.9 DP, highest in the entire benchmark. Code sits at 89.2 DP.

  • Extra config: -muge --merge-qkv -mtprot iq4_ks -cram 32768 --slot-save-path /root/slot --ctx-checkpoints 32
  • VRAM: 22304 MiB. Higher VRAM due to slot checkpoints.
  • Context degradation: −32% (34.6→23.5). Worst drop across all setups.

ik_llama + MTP (Standard Config)

The baseline for native MTP. Running with standard parameters (n_max=2) without ubergarm's recommended tweaks or the hybrid ngram module. It delivers a balanced 73.1 DP in code and 61.7 DP in narrative.

  • VRAM: 20208 MiB.
  • Context degradation: −25% (33.8→25.4 gen_tps).

ik_llama + DFlash

Tested with beellama's independent draft model. Code 96.8 DP, competitive with MTP+ngram, but narrative suffers heavily (45.7 DP). TTFT is high (504ms) due to separate draft model loading??.

mainline llama.cpp — The Reference

No forks, no patches. Upstream speculative MTP. Standard Q4_K_M quantization.

  • Code: 64.7 DP | Narrative: 52.6 DP
  • TTFT: 288ms — lowest across the board, zero overhead
  • Context consistency: 0% degradation (31.3→31.3 DP between 72k and 128k). This matters: mainline maintains speed regardless of context length (or maybe an outlier?)

It’s not the fastest in raw throughput, but it’s the most predictable.

Spiritbuun — Optimized MTP, Failed DFlash

Spiritbuun MTP

Fork with optimized MTP (turbo cache, flash-attn). Q4_K_M quantization.

I tested this because it gave me the best results with the Qwen 3.6 35B A3B MoE model, paired with APEX quants (see my post about it if you are interested).

  • Code: 59.7 DP | Narrative: 45.7 DP
  • Context degradation: −9%. Best consistency after mainline.
  • TTFT: 294ms — nearly identical to mainline

Spiritbuun DFlash

Tested with its own draft model. Failed to reach MTP speeds: 67.0 DP code, 30.4 DP narrative. I didn't test long context performance, it didn't seem worth it.

beellama DFlash — Brutal Code Speed, High TTFT Cost

Uses own draft model (anbeeld-Qwen3.6-27B-DFlash-IQ4_XS.gguf) with cross-ctx 1024 and unified KV.

  • Code: 96.8 DP — second best overall, very close to ik_llama
  • Narrative: 45.7 DP
  • Drawback: 504ms TTFT (nearly double mainline). First word takes half a second.
  • VRAM: 20814 MiB. Moderate GPU usage (73%).
  • Context: 128k holds 27.1 DP. Better than ik_llama MTP in long context.

LUCEBOX DFlash — Not working for me

Independent server engine with DFlash, TQ3 KV cache, and PFlash!

  • Code: 32.7 DP | Narrative: 32.5 DP
  • Worse than running without speculative decoding in many cases

Maybe I didn't understand how to use it consistently? The env's I've used in my incus container:

 environment.DFLASH_FP_USE_BSA: "1" environment.DFLASH_HOST: 0.0.0.0 environment.DFLASH_KVFLASH: auto environment.DFLASH_PORT: "8080" environment.DFLASH_PREFILL_DRAFTER: /opt/lucebox-hub/server/models/unsloth-Qwen3-0.6B-BF16.gguf environment.DFLASH_PREFILL_MODE: auto environment.DFLASH_SERVER_BIN: /opt/lucebox-hub/server/build/dflash_server environment.DFLASH_TARGET: /opt/lucebox-hub/server/models/Qwen3.6-27B-Q4_K_M.gguf environment.DFLASH27B_KV_TQ3: "1" 

Consistency Verdict

If we rank purely by real-world consistency (speed stability across context lengths + low TTFT + low VRAM overhead):

  1. mainline llama.cpp MTP — The clear winner for consistency. Almost zero degradation between 72k and 128k. Lowest TTFT (288ms). Stable VRAM (~21GB). No external draft model dependency. It doesn't break, doesn't spike, doesn't throttle.
  2. Spiritbuun MTP — Only 9% degradation, TTFT 294ms, very stable. Slightly lower throughput than mainline but remarkably predictable.
  3. LUCEBOX DFlash — Technically consistent (0.1% variance), but consistently slow. Not useful for me.
  4. ik_llama setups — Fast in short context, but pay a heavy price in long context (−25% to −32% degradation).

My take: The differences between mainline and Spiritbuun are marginal (~3-5 DP). But mainline's zero degradation and lowest TTFT make it the most practically consistent setup. If you're running long documents or RAG pipelines, mainline won't surprise you. ik_llama wins on speed, but you're betting on short context.

Final Recommendations

Priority Best Option Why
Code speed ik_llama MTP+ngram 98.5 DP, double the baseline
Narrative speed ik_llama MTP (ubergarm) 63.9 DP
Context consistency mainline llama.cpp 0% degradation, lowest TTFT
Balance speed + stability Spiritbuun MTP Near-mainline consistency with slightly better throughput
Low TTFT mainline llama.cpp 288ms, zero overhead

What do you think?

submitted by /u/old-mike
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA