r/LocalLLaMA · · 1 min read

Try ik_llama.cpp with MTP if you have limited VRAM. You will be pleasantly surprised!

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Had been getting great MTP performance with llama.cpp on my RTX 4070 Super 12GB (75-80 tok/s), until they actually merged the MTP PR. Then, performance tanked (65-70 tok/s) and was barely above non-MTP. I then decided to try out ik_llama.cpp since it also supports MTP. I did not expect such a huge speed difference!

Here's my latest mtp-bench.py results with byteshape's Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf:

❯ ./mtp-bench.py code_python pred= 192 draft= 135 acc= 122 rate=0.904 tok/s=105.1 code_cpp pred= 192 draft= 136 acc= 120 rate=0.882 tok/s=110.3 explain_concept pred= 192 draft= 133 acc= 116 rate=0.872 tok/s=109.0 summarize pred= 56 draft= 38 acc= 37 rate=0.974 tok/s=122.3 qa_factual pred= 192 draft= 141 acc= 127 rate=0.901 tok/s=116.0 translation pred= 192 draft= 143 acc= 113 rate=0.790 tok/s=104.1 creative_short pred= 192 draft= 133 acc= 118 rate=0.887 tok/s=109.4 stepwise_math pred= 192 draft= 140 acc= 125 rate=0.893 tok/s=114.6 long_code_review pred= 192 draft= 128 acc= 108 rate=0.844 tok/s=101.4 Aggregate: { "n_requests": 9, "total_predicted": 1592, "total_draft": 1127, "total_draft_accepted": 986, "aggregate_accept_rate": 0.8749, "wall_s_total": 16.64 } 

That's a 110.24 tok/s average!

If you want to get similar results on a 12GB RTX GPU, make sure you use the following ik_llama.cpp launch parameters:

llama-server \ -m Qwen3.6-35B-A3B-IQ4_XS-4.19bpw.gguf \ --fit \ --fit-margin 1664 \ --ctx-size 131072 \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --cache-type-k-draft q8_0 \ --cache-type-v-draft q8_0 \ --multi-token-prediction \ --draft-p-min 0.75 \ --draft-max 3 \ --no-mmap \ --mlock \ --threads 8 \ --temp 0.0 

I also want to mention that I'm on CachyOS running my GPU as a secondary GPU, with the monitor plugged in the iGPU, so I can use 100% of available VRAM. If you get a OOM error when loading the model, increase --fit-margin to 1792 or even 2048.

Cheers!

submitted by /u/janvitos
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA