r/LocalLLaMA · May 29, 2026 · 4 min read

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Hey guys,

I spent the last few weeks benchmarking Multi-Token Prediction (MTP) on Gemma 4 31B and Qwen 3.6 27B locally GGUF, FP8 using both vLLM and llama.cpp. MTP is the inference trick every major lab is quietly adding to their stack right now and the results genuinely surprised me.

Benchmark config:

- 10 runs per session

- 1500 tokens per run

- Sequential mode on vllm as I couldn't feed two models fully

- Same prompt across all runs

- Prefix caching OFF

Models used:

- unsloth/Qwen3.6-27B-MTP-GGUF (Q8_0) via llama.cpp

- RedHatAI/gemma-4-31B-it-FP8-block via vLLM

- Qwen/Qwen3.6-27B-FP8 via vLLM

Hardware: AMD Ryzen 9 9950X | NVIDIA RTX PRO 6000 Blackwell |

96GB VRAM | 92GB RAM | CUDA 13.1 | Ubuntu 24.04

Here is the full leaderboard from my runs:

https://preview.redd.it/3seyqbmi754h1.png?width=1440&format=png&auto=webp&s=23aaf1bc4cd190d4f49a06f03b62018bb90dbdc0

Best result: 132.52 vs 39.69 tok/s = 3.34x faster. On quality degradation — I did not do a deep evaluation due to time constraints. However based on studying the architecture, the design makes it hard to degrade quality: the target model still verifies every token before accepting it, so the output path is the same as standard decoding. On VRAM difference — I tried to capture it but ran out of time for a proper measurement. From a quick spot check it looked negligible, which also aligns with the architecture since the draft model is tiny (76M parameters on Gemma 4). But I would not claim either of these as confirmed — take them as directional observations, not benchmarked facts.

Here are my 5 biggest findings:

1. vLLM beats llama.cpp for MTP on Gemma 4 — but llama.cpp is solid on Qwen

vLLM hit 132.52 tok/s on Gemma 4 with n=5. llama.cpp peaked at 117.70 tok/s on Qwen 3.6 Q8 with n_max=3. Important caveat: llama.cpp does NOT support Gemma 4 MTP yet so this is not a direct apples-to-apples comparison between engines. vLLM implementation is also more mature right now since MTP support was added to llama.cpp more recently.

2. Optimal speculative token count is NOT always the highest

For vLLM + Gemma 4: n=5 was best (132.52 tok/s)

For llama.cpp + Qwen 3.6: n=3 was the sweet spot (117.70 tok/s), then performance oscillated at n=4 and n=5. More speculative tokens does not equal more speed. There is a sweet spot per model and engine combination, so you need to benchmark it yourself. Also it could guess different depending on your prompt so tests a few prompt sand get avg etc.

3. Dense models are where MTP gains suppose to be biggest

I tested MTP on both Gemma 4 31B and Qwen 3.6 27B, because dense models are often the cleanest place to measure speculative decoding gains. In my tests, Gemma 4 reached a 3.34x speedup, while Qwen 3.6 on vLLM reached a 2.59x speedup. I would not frame this as a universal rule, but I run these test on a dense models as it suppose to deliver the clearest gains. The reason is architectural: dense models have a more uniform forward pass, which can make the draft-and-verify path easier to optimize and more predictable but as always it depends on the whole model architecture.

4. The decode phase is memory bandwidth bound — not compute bound

This is one of the reasons MTP can work so well.

During autoregressive decoding, the model usually generates one token at a time. For each new token, the runtime has to run another target-model step and move large amounts of data through GPU memory. In many low-batch inference workloads, the bottleneck is not that the GPU lacks raw compute. The bottleneck is that the system spends a lot of time moving model weights and KV-cache data through memory for every decoding step.

MTP helps by drafting several likely next tokens and letting the target model verify them together. When the draft tokens are accepted, the system can make progress by more than one token from a single verification pass. In other words, MTP does not remove the memory bandwidth cost, but it can amortize that cost across multiple accepted tokens.

That is why the speedup depends heavily on acceptance rate. If the draft path predicts well, the target model can accept more tokens per pass and decoding becomes faster. If the draft path predicts poorly, fewer tokens are accepted and the speedup becomes smaller.

5. Inference speed = money, not just UX

If you are serving LLMs in production, 3x faster inference means 3x more users on the same hardware or 3x lower compute cost for the same load. Training burns money. Inference prints it — or bleeds it if you are not optimized. This is why vLLM and llama.cpp both rushed to add MTP support.

One of tests.

📦 Resources:

GitHub — full setup with Docker configs, benchmark scripts, and

CSV results, there is also video where I explain the architecture and idea

https://github.com/lukaLLM/llamacpp-vllm-mtp-setup-and-speed-benchmark-qwen3.6-gemma4

Let me know what hardware you are running MTP or other inference speed ups you found useful or what where yours findings!

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Discussion (0)

More from r/LocalLLaMA