r/LocalLLaMA · May 17, 2026 · 5 min read

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I have an Asus gaming laptop from 2021 that I bought used for 500€ last year. I wanted to see if the recently merged MTP support in llama.cpp is worth using on such a VRAM constrained device for the Qwen3.6-35B-A3B model. So I did some experiments to measure performance with and without MTP.

TL;DR: It's not worth it. The prompt processing is so much slower with MTP that it outweighs the minimal gains in TG speeds. However, I did discover a useful VRAM saving trick: using q4_0 quantization for the draft KV cache works just as well as q8_0 and saves a small bit of VRAM.

Hardware

Asus ROG Zephyrus G14 laptop, 2021 model
AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
llama.cpp version: 9198 (a6d6183db) built from current master branch with GNU 13.3.0 for Linux x86_64
CUDA 12.0 installed from Ubuntu repositories

Test setup

I fixed the following parameters for all the experiments:

Unsloth Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL model (pushing the maximum this system can run; I used the same model for both MTP and non-MTP, just varying the command line arguments so the MTP part of the model was not used in all runs)
q8_0 quantization for the main KV cache (I don't want to compromise on quality too much)
context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
for MTP, I used --spec-draft-n-max 2 (I know that 3 might be slightly better in some cases, but decided to stick to this to make the results comparable)
mmap enabled (it's the only way I can run this model without freezing my machine...)

I varied these parameters:

MTP vs non-MTP (including/omitting MTP specific CLI parameters)
ubatch size: 512, 1024, 1536, 2048
draft model KV cache quantization: either q8_0 or q4_0 (always same for both K & V)
--fit-target set to the lowest value (in steps of 64) that works without OOM errors

Here is an example of a full llama-server command (MTP 1 in the table below):

build/bin/llama-server \ -m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \ --threads 8 \ -ub 512 \ --parallel 1 \ --fit-target 448 \ -c 65536 \ -ctk q8_0 \ -ctv q8_0 \ -ctkd q8_0 \ -ctvd q8_0 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.95 \ --min-p 0.0 \ --top-k 20 \ --repeat-penalty 1.0 \ --presence-penalty 0.0 \ --spec-type draft-mtp \ --spec-draft-n-max 2

The tasks I gave the model were two:

MB: Run the mtp-bench.py script to benchmark MTP on various different tasks.
S: Summarize a longer document (MTP PR 22673 from github) into a few bullet points. This is a 13448 token prompt followed by 2000-3000 tokens of generation.

Results

This table summarizes the outcome. ub = ubatch size, dKV = draft KV quant type, fitt = fit-target value, acc% = acceptance rate.

Setup	ub	dKV	fitt	MB TG	MB acc%	S PP	S TG	S acc%
No MTP 1	512	-	0	25.0	-	178	23.8	-
No MTP 2	1024	-	0	23.1	-	292	22.5	-
No MTP 3	1536	-	0	24.5	-	299	24.4	-
No MTP 4	2048	-	0	23.0	-	436	26.1	-
MTP 1	512	q8_0	448	27.3	81.5	143	26.1	76.5
MTP 2	1024	q8_0	960	18.7	82.7	138	25.9	72.0
MTP 3	512	q4_0	448	26.4	81.5	139	25.3	73.4
MTP 4	1024	q4_0	960	25.4	82.7	198	23.7	73.7

I also tried higher ubatch values with MTP, but the results were so bad (TG 10-15 tok/s, probably due to running out of RAM and swapping) that I aborted those runs.

Verdict

The baseline "No MTP 4" with ubatch=2048 is clearly the best non-MTP setup. It reached PP speeds over 400 tok/s and TG speeds of 23-26 tok/s.
The "MTP 1" run with ubatch=512 reached the best TG speed (over 27 tok/s) in mtp-bench but was tied with "No MTP 4" on the summarization task TG. PP speeds were much lower than any non-MTP setups.
Increasing ubatch size in MTP can improve PP speeds a bit, especially in the "MTP 4" setup which also used q4_0 quantization for the draft KV cache. But this practically eliminated the benefit in TG speeds while still more than halving PP speeds.
In short: MTP is not worth it in this setting. Tiny increase in TG for some cases, but always a giant drop in PP speeds. If PP speeds for MTP are later improved in llama.cpp (this was listed as a known issue in the PR), this might change.

Observations

I was surprised to see that using q4_0 quantization for the draft model KV cache had negligible impact on draft model accuracy. This saves a tiny bit of VRAM, so might be a useful trick for very VRAM constrained setups.
There is a bit of unexplained variation between measurements, probably due to random change, CPU/GPU temperature throttling etc. Not too bad, but take with a grain of salt.
VRAM is obviously very tight from the start. The MTP VRAM overhead easily pushes the system into a badly performing scenario.
The --fit and --fit-target options don't seem to take into account the MTP overhead; you need to reserve some memory for MTP and this amount depends mainly on the ubatch size. Thus you have to set --fit-target manually if you want to squeeze the maximum performance out of your limited VRAM. In my case, setting fit-target to a number a bit less than the ubatch size seemed to work, but YMMV.

Notes

This post was constructed from 100% organic ingredients. No AIs were harmed in the process.

My second post here. Happy to answer any questions.

submitted by /u/OsmanthusBloom
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.