r/LocalLLaMA · · 1 min read

Does quantizing change the MTP draft rate?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Does quantizing change the MTP draft rate?

Speculative decoding speeds up LLM generation by using a small "drafter" model to predict several tokens ahead of the main model. The main model then verifies these predictions in a single forward pass. If the main model is heavily quantized (low bit-rate), it becomes less "consistent" with the drafter, lowering the acceptance rate.

Models used:

Acceptance rate across quantization levels are tested as a function of draft depths (n), and reported with mean ± 1σ over 3 reps (5 mixed coding/reasoning prompts × 200 tokens, temperature=0.3, thinking off, distinct seeds per rep):

Quant n=1 n=2 n=3 n=4
Q5_K_S 88.5 ±1.0% 81.9 ±0.3% 74.2 ±0.9% 66.7 ±0.5%
IQ4_XS 86.7 ±0.1% 80.3 ±0.9% 72.3 ±0.5% 65.2 ±0.9%
IQ3_M 86.8 ±0.9% 78.3 ±0.2% 71.7 ±1.6% 65.0 ±2.0%
IQ2_M 84.5 ±0.5% 76.7 ±2.5% 69.3 ±1.5% 61.2 ±2.0%

Takeaways. Acceptance rates decline as draft depth increases across all quantization levels. While Q5_K_S provides the highest fidelity, IQ4_XS and IQ3_M perform nearly identically, and even the 2-bit IQ2_M maintains high acceptance for single-token drafts. The speed up associated with these draft levels is very hardware and architecture dependent, the biggest gains come from using n=2 on a cuda device while apple metal only marginally benefits from n=1.

Try it yourself: Download the weights, all you need is ~12 Gb of memory to run the 31B trunk at IQ2_M. Or ~24 Gb if you want to run Q5_K_S with vision capabilities and MTP support.

Run it via llama-server:

llama-server -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ4_XS \ --spec-type draft-mtp --spec-draft-n-max 2 
submitted by /u/professormunchies
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA