r/LocalLLaMA · · 3 min read

unsloth vs bartowski MTP ggufs

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I noticed that bartowski's MTP ggufs are bigger than unsloth. I asked bartowski and he said he used Q8_0 quant for the MTP head. So I compare the decoding performance of the two.

/build/bin/llama-server -m ~/gguf/Qwen3.5-4B-Q4_0.gguf --host 0.0.0.0 --port 8080 -c 4096 -fa on --no-mmap -np 1 -ngl 99 --spec-type draft-mtp

Since I am more interested in running them on snapdragon smartphones, so I only tested Q4_0, IQ4_NL, Q4_1, MXFP4_MOE, Q8_0. I am limited by my 24GB VRAM 3090, so I can't test Q8_0 for the big models.

I used am17an's (the Qwen MTP PR author) mtp-bench.py for benching: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090#file-mtp-bench-py

Qwen3.5-4B NoMTP VRAM NoMTP t/s MTP3 VRAM MTP3 acc% MTP3 t/s
unsloth Q4_0 3530MiB 144.9t/s 4694MiB 0.5832 134.67t/s
bartowski Q4_0 3634MiB 143.05t/s 4796MiB 0.5804 132.97t/s
unsloth IQ4_NL 3588MiB 140.51t/s 4752MiB 0.5612 128.84t/s
bartowski IQ4_NL 3632MiB 141.31t/s 4794MiB 0.5748 130.4t/s
unsloth Q4_1 3728MiB 141.31t/s 4890MiB 0.6115 136.84t/s
bartowski Q4_1 3826MiB 138.88t/s 4988MiB 0.6188 131.96t/s
unsloth Q8_0 5370MiB 110.48t/s 6532MiB 0.5767 125.15t/s
bartowski Q8_0 5390MiB 111.66t/s 6552MiB 0.5903 124.24t/s
Qwen3.5-9B NoMTP VRAM NoMTP t/s MTP3 VRAM MTP3 acc% MTP3 t/s
unsloth Q4_0 5740MiB 105.32t/s 6934MiB 0.7076 122.55t/s
bartowski Q4_0 5922MiB 102.29t/s 7114MiB 0.6781 118.84t/s
unsloth IQ4_NL 5828MiB 104.7t/s 7022MiB 0.6576 116.73t/s
bartowski IQ4_NL 5916MiB 103.57t/s 7108MiB 0.6493 116.19t/s
unsloth Q4_1 6128MiB 101.57t/s 7320MiB 0.6657 115.58t/s
bartowski Q4_1 6300MiB 99.84t/s 7492MiB 0.6595 115.93t/s
unsloth Q8_0 9280MiB 74.42t/s 10472MiB 0.7013 104.59t/s
bartowski Q8_0 9308MiB 74.53t/s 10500MiB 0.693 105.23t/s
Qwen3.6-27B NoMTP VRAM NoMTP t/s MTP3 VRAM MTP3 acc% MTP3 t/s
unsloth Q4_0 15870MiB 41.43t/s 17376MiB 0.6829 63.46t/s
bartowski Q4_0 16352MiB 41.16t/s 17856MiB 0.7188 64.84t/s
unsloth Q4_1 17208MiB 39.68t/s 18712MiB 0.7011 66.03t/s
bartowski Q4_1 17682MiB 38.73t/s 19186MiB 0.6853 63.15t/s
unsloth IQ4_NL 16138MiB 40.76t/s 17644MiB 0.6939 60.85t/s
bartowski IQ4_NL 16328MiB 40.67t/s 17832MiB 0.7241 65.19t/s
Qwen3.6-35B-A3B NoMTP VRAM NoMTP t/s MTP3 VRAM MTP3 acc% MTP3 t/s
unsloth IQ4_NL 18122MiB 118.23t/s 19368MiB 0.6641 108.83t/s
bartowski IQ4_NL 20482MiB 127.58t/s 21726MiB 0.6881 112.53t/s

Observations:

  1. For 4B, MTP is only faster for Q8_0. But you are paying 21.6% VRAM to gain 13.3% decoding speed.
  2. For 9B, MTP is faster across the board. Speed gain seems to correlate with the acceptance rate as expected
  3. For 27B, speed gain is now very significant at 53.2% for only 9.5% VRAM. This indicates the larger the dense model, it makes more sense to run MTP.
  4. For 35B-A3B MoE, only IQ4_NL is available from both. Strangely, the bartowski gguf is 13% larger than unsloth gguf but 8% faster.
  5. bartowski's ggufs are bigger than unsloth in general especially for the MoE model. Since there is no significant speed gain anyway, there is no reasons to use bartowski's MTP ggufs if speed is the only concern.

Please note that I only measure the speed but not perplexity or intelligence. So there might be advantages in these areas for the bartowski ggufs. This test is for single user, I presume MTP will bring more benefits for multi-users.

Overall, while MTP is nice to have, it often makes things worse. It is better you conduct your own tests to see if the extra VRAM is worth it for the decoding speed up (if there are any at all).

Does anyone know why the size difference is particularly big for the MoE ggufs?

submitted by /u/Ok_Warning2146
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA