r/LocalLLaMA · · 5 min read

MTP for Qwen3.6-35B-A3B on 6GB VRAM laptop: not worth it

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I have an Asus gaming laptop from 2021 that I bought used for 500€ last year. I wanted to see if the recently merged MTP support in llama.cpp is worth using on such a VRAM constrained device for the Qwen3.6-35B-A3B model. So I did some experiments to measure performance with and without MTP.

TL;DR: It's not worth it. The prompt processing is so much slower with MTP that it outweighs the minimal gains in TG speeds. However, I did discover a useful VRAM saving trick: using q4_0 quantization for the draft KV cache works just as well as q8_0 and saves a small bit of VRAM.

Hardware

  • Asus ROG Zephyrus G14 laptop, 2021 model
  • AMD Ryzen 7 5800HS with Radeon Graphics (8 CPU cores / 16 threads)
  • NVIDIA RTX 3060 Laptop GPU, 6GB VRAM
  • 24GB RAM (DDR4 3200 MT/s), 1TB SSD

Software

  • Linux Mint 22.2 (based on Ubuntu 24.04) with the Cinnamon desktop running on the Radeon iGPU (thus the 3060 was dedicated to llama.cpp only)
  • llama.cpp version: 9198 (a6d6183db) built from current master branch with GNU 13.3.0 for Linux x86_64
  • CUDA 12.0 installed from Ubuntu repositories

Test setup

I fixed the following parameters for all the experiments:

  • Unsloth Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL model (pushing the maximum this system can run; I used the same model for both MTP and non-MTP, just varying the command line arguments so the MTP part of the model was not used in all runs)
  • q8_0 quantization for the main KV cache (I don't want to compromise on quality too much)
  • context size 65536 (enough to do agentic coding on e.g. Pi or Dirac, or run Hermes Agent)
  • for MTP, I used --spec-draft-n-max 2 (I know that 3 might be slightly better in some cases, but decided to stick to this to make the results comparable)
  • mmap enabled (it's the only way I can run this model without freezing my machine...)

I varied these parameters:

  • MTP vs non-MTP (including/omitting MTP specific CLI parameters)
  • ubatch size: 512, 1024, 1536, 2048
  • draft model KV cache quantization: either q8_0 or q4_0 (always same for both K & V)
  • --fit-target set to the lowest value (in steps of 64) that works without OOM errors

Here is an example of a full llama-server command (MTP 1 in the table below):

build/bin/llama-server \ -m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \ --threads 8 \ -ub 512 \ --parallel 1 \ --fit-target 448 \ -c 65536 \ -ctk q8_0 \ -ctv q8_0 \ -ctkd q8_0 \ -ctvd q8_0 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.95 \ --min-p 0.0 \ --top-k 20 \ --repeat-penalty 1.0 \ --presence-penalty 0.0 \ --spec-type draft-mtp \ --spec-draft-n-max 2 

The tasks I gave the model were two:

  1. MB: Run the mtp-bench.py script to benchmark MTP on various different tasks.
  2. S: Summarize a longer document (MTP PR 22673 from github) into a few bullet points. This is a 13448 token prompt followed by 2000-3000 tokens of generation.

Results

This table summarizes the outcome. ub = ubatch size, dKV = draft KV quant type, fitt = fit-target value, acc% = acceptance rate.

Setup ub dKV fitt MB TG MB acc% S PP S TG S acc%
No MTP 1 512 - 0 25.0 - 178 23.8 -
No MTP 2 1024 - 0 23.1 - 292 22.5 -
No MTP 3 1536 - 0 24.5 - 299 24.4 -
No MTP 4 2048 - 0 23.0 - 436 26.1 -
MTP 1 512 q8_0 448 27.3 81.5 143 26.1 76.5
MTP 2 1024 q8_0 960 18.7 82.7 138 25.9 72.0
MTP 3 512 q4_0 448 26.4 81.5 139 25.3 73.4
MTP 4 1024 q4_0 960 25.4 82.7 198 23.7 73.7

I also tried higher ubatch values with MTP, but the results were so bad (TG 10-15 tok/s, probably due to running out of RAM and swapping) that I aborted those runs.

Verdict

  • The baseline "No MTP 4" with ubatch=2048 is clearly the best non-MTP setup. It reached PP speeds over 400 tok/s and TG speeds of 23-26 tok/s.
  • The "MTP 1" run with ubatch=512 reached the best TG speed (over 27 tok/s) in mtp-bench but was tied with "No MTP 4" on the summarization task TG. PP speeds were much lower than any non-MTP setups.
  • Increasing ubatch size in MTP can improve PP speeds a bit, especially in the "MTP 4" setup which also used q4_0 quantization for the draft KV cache. But this practically eliminated the benefit in TG speeds while still more than halving PP speeds.
  • In short: MTP is not worth it in this setting. Tiny increase in TG for some cases, but always a giant drop in PP speeds. If PP speeds for MTP are later improved in llama.cpp (this was listed as a known issue in the PR), this might change.

Observations

  • I was surprised to see that using q4_0 quantization for the draft model KV cache had negligible impact on draft model accuracy. This saves a tiny bit of VRAM, so might be a useful trick for very VRAM constrained setups.
  • There is a bit of unexplained variation between measurements, probably due to random change, CPU/GPU temperature throttling etc. Not too bad, but take with a grain of salt.
  • VRAM is obviously very tight from the start. The MTP VRAM overhead easily pushes the system into a badly performing scenario.
  • The --fit and --fit-target options don't seem to take into account the MTP overhead; you need to reserve some memory for MTP and this amount depends mainly on the ubatch size. Thus you have to set --fit-target manually if you want to squeeze the maximum performance out of your limited VRAM. In my case, setting fit-target to a number a bit less than the ubatch size seemed to work, but YMMV.

Notes

This post was constructed from 100% organic ingredients. No AIs were harmed in the process.

My second post here. Happy to answer any questions.

submitted by /u/OsmanthusBloom
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA