r/LocalLLaMA · June 1, 2026 · 3 min read

unsloth vs bartowski MTP ggufs

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I noticed that bartowski's MTP ggufs are bigger than unsloth. I asked bartowski and he said he used Q8_0 quant for the MTP head. So I compare the decoding performance of the two.

/build/bin/llama-server -m ~/gguf/Qwen3.5-4B-Q4_0.gguf --host 0.0.0.0 --port 8080 -c 4096 -fa on --no-mmap -np 1 -ngl 99 --spec-type draft-mtp

Since I am more interested in running them on snapdragon smartphones, so I only tested Q4_0, IQ4_NL, Q4_1, MXFP4_MOE, Q8_0. I am limited by my 24GB VRAM 3090, so I can't test Q8_0 for the big models.

I used am17an's (the Qwen MTP PR author) mtp-bench.py for benching: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090#file-mtp-bench-py

Qwen3.5-4B	NoMTP VRAM	NoMTP t/s	MTP3 VRAM	MTP3 acc%	MTP3 t/s
unsloth Q4_0	3530MiB	144.9t/s	4694MiB	0.5832	134.67t/s
bartowski Q4_0	3634MiB	143.05t/s	4796MiB	0.5804	132.97t/s
unsloth IQ4_NL	3588MiB	140.51t/s	4752MiB	0.5612	128.84t/s
bartowski IQ4_NL	3632MiB	141.31t/s	4794MiB	0.5748	130.4t/s
unsloth Q4_1	3728MiB	141.31t/s	4890MiB	0.6115	136.84t/s
bartowski Q4_1	3826MiB	138.88t/s	4988MiB	0.6188	131.96t/s
unsloth Q8_0	5370MiB	110.48t/s	6532MiB	0.5767	125.15t/s
bartowski Q8_0	5390MiB	111.66t/s	6552MiB	0.5903	124.24t/s

Qwen3.5-9B	NoMTP VRAM	NoMTP t/s	MTP3 VRAM	MTP3 acc%	MTP3 t/s
unsloth Q4_0	5740MiB	105.32t/s	6934MiB	0.7076	122.55t/s
bartowski Q4_0	5922MiB	102.29t/s	7114MiB	0.6781	118.84t/s
unsloth IQ4_NL	5828MiB	104.7t/s	7022MiB	0.6576	116.73t/s
bartowski IQ4_NL	5916MiB	103.57t/s	7108MiB	0.6493	116.19t/s
unsloth Q4_1	6128MiB	101.57t/s	7320MiB	0.6657	115.58t/s
bartowski Q4_1	6300MiB	99.84t/s	7492MiB	0.6595	115.93t/s
unsloth Q8_0	9280MiB	74.42t/s	10472MiB	0.7013	104.59t/s
bartowski Q8_0	9308MiB	74.53t/s	10500MiB	0.693	105.23t/s

Qwen3.6-27B	NoMTP VRAM	NoMTP t/s	MTP3 VRAM	MTP3 acc%	MTP3 t/s
unsloth Q4_0	15870MiB	41.43t/s	17376MiB	0.6829	63.46t/s
bartowski Q4_0	16352MiB	41.16t/s	17856MiB	0.7188	64.84t/s
unsloth Q4_1	17208MiB	39.68t/s	18712MiB	0.7011	66.03t/s
bartowski Q4_1	17682MiB	38.73t/s	19186MiB	0.6853	63.15t/s
unsloth IQ4_NL	16138MiB	40.76t/s	17644MiB	0.6939	60.85t/s
bartowski IQ4_NL	16328MiB	40.67t/s	17832MiB	0.7241	65.19t/s

Qwen3.6-35B-A3B	NoMTP VRAM	NoMTP t/s	MTP3 VRAM	MTP3 acc%	MTP3 t/s
unsloth IQ4_NL	18122MiB	118.23t/s	19368MiB	0.6641	108.83t/s
bartowski IQ4_NL	20482MiB	127.58t/s	21726MiB	0.6881	112.53t/s

Observations:

For 4B, MTP is only faster for Q8_0. But you are paying 21.6% VRAM to gain 13.3% decoding speed.
For 9B, MTP is faster across the board. Speed gain seems to correlate with the acceptance rate as expected
For 27B, speed gain is now very significant at 53.2% for only 9.5% VRAM. This indicates the larger the dense model, it makes more sense to run MTP.
For 35B-A3B MoE, only IQ4_NL is available from both. Strangely, the bartowski gguf is 13% larger than unsloth gguf but 8% faster.
bartowski's ggufs are bigger than unsloth in general especially for the MoE model. Since there is no significant speed gain anyway, there is no reasons to use bartowski's MTP ggufs if speed is the only concern.

Please note that I only measure the speed but not perplexity or intelligence. So there might be advantages in these areas for the bartowski ggufs. This test is for single user, I presume MTP will bring more benefits for multi-users.

Overall, while MTP is nice to have, it often makes things worse. It is better you conduct your own tests to see if the extra VRAM is worth it for the decoding speed up (if there are any at all).

Does anyone know why the size difference is particularly big for the MoE ggufs?

submitted by /u/Ok_Warning2146
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA