r/LocalLLaMA · June 27, 2026 · 1 min read

Does quantizing change the MTP draft rate?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Does quantizing change the MTP draft rate?

Speculative decoding speeds up LLM generation by using a small "drafter" model to predict several tokens ahead of the main model. The main model then verifies these predictions in a single forward pass. If the main model is heavily quantized (low bit-rate), it becomes less "consistent" with the drafter, lowering the acceptance rate.

Models used:

Trunk: Gemma 4-31B-it (quantized GGUFs)
Drafter: Gemma 4-31B-it-assistant (MTP drafter)

Acceptance rate across quantization levels are tested as a function of draft depths (n), and reported with mean ± 1σ over 3 reps (5 mixed coding/reasoning prompts × 200 tokens, temperature=0.3, thinking off, distinct seeds per rep):

Quant	n=1	n=2	n=3	n=4
Q5_K_S	88.5 ±1.0%	81.9 ±0.3%	74.2 ±0.9%	66.7 ±0.5%
IQ4_XS	86.7 ±0.1%	80.3 ±0.9%	72.3 ±0.5%	65.2 ±0.9%
IQ3_M	86.8 ±0.9%	78.3 ±0.2%	71.7 ±1.6%	65.0 ±2.0%
IQ2_M	84.5 ±0.5%	76.7 ±2.5%	69.3 ±1.5%	61.2 ±2.0%

Takeaways. Acceptance rates decline as draft depth increases across all quantization levels. While Q5_K_S provides the highest fidelity, IQ4_XS and IQ3_M perform nearly identically, and even the 2-bit IQ2_M maintains high acceptance for single-token drafts. The speed up associated with these draft levels is very hardware and architecture dependent, the biggest gains come from using n=2 on a cuda device while apple metal only marginally benefits from n=1.

Try it yourself: Download the weights, all you need is ~12 Gb of memory to run the 31B trunk at IQ2_M. Or ~24 Gb if you want to run Q5_K_S with vision capabilities and MTP support.

Run it via llama-server:

llama-server -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ4_XS \ --spec-type draft-mtp --spec-draft-n-max 2

submitted by /u/professormunchies
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA