unsloth vs bartowski MTP ggufs
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I noticed that bartowski's MTP ggufs are bigger than unsloth. I asked bartowski and he said he used Q8_0 quant for the MTP head. So I compare the decoding performance of the two.
/build/bin/llama-server -m ~/gguf/Qwen3.5-4B-Q4_0.gguf --host 0.0.0.0 --port 8080 -c 4096 -fa on --no-mmap -np 1 -ngl 99 --spec-type draft-mtp
Since I am more interested in running them on snapdragon smartphones, so I only tested Q4_0, IQ4_NL, Q4_1, MXFP4_MOE, Q8_0. I am limited by my 24GB VRAM 3090, so I can't test Q8_0 for the big models.
I used am17an's (the Qwen MTP PR author) mtp-bench.py for benching: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090#file-mtp-bench-py
| Qwen3.5-4B | NoMTP VRAM | NoMTP t/s | MTP3 VRAM | MTP3 acc% | MTP3 t/s |
|---|---|---|---|---|---|
| unsloth Q4_0 | 3530MiB | 144.9t/s | 4694MiB | 0.5832 | 134.67t/s |
| bartowski Q4_0 | 3634MiB | 143.05t/s | 4796MiB | 0.5804 | 132.97t/s |
| unsloth IQ4_NL | 3588MiB | 140.51t/s | 4752MiB | 0.5612 | 128.84t/s |
| bartowski IQ4_NL | 3632MiB | 141.31t/s | 4794MiB | 0.5748 | 130.4t/s |
| unsloth Q4_1 | 3728MiB | 141.31t/s | 4890MiB | 0.6115 | 136.84t/s |
| bartowski Q4_1 | 3826MiB | 138.88t/s | 4988MiB | 0.6188 | 131.96t/s |
| unsloth Q8_0 | 5370MiB | 110.48t/s | 6532MiB | 0.5767 | 125.15t/s |
| bartowski Q8_0 | 5390MiB | 111.66t/s | 6552MiB | 0.5903 | 124.24t/s |
| Qwen3.5-9B | NoMTP VRAM | NoMTP t/s | MTP3 VRAM | MTP3 acc% | MTP3 t/s |
|---|---|---|---|---|---|
| unsloth Q4_0 | 5740MiB | 105.32t/s | 6934MiB | 0.7076 | 122.55t/s |
| bartowski Q4_0 | 5922MiB | 102.29t/s | 7114MiB | 0.6781 | 118.84t/s |
| unsloth IQ4_NL | 5828MiB | 104.7t/s | 7022MiB | 0.6576 | 116.73t/s |
| bartowski IQ4_NL | 5916MiB | 103.57t/s | 7108MiB | 0.6493 | 116.19t/s |
| unsloth Q4_1 | 6128MiB | 101.57t/s | 7320MiB | 0.6657 | 115.58t/s |
| bartowski Q4_1 | 6300MiB | 99.84t/s | 7492MiB | 0.6595 | 115.93t/s |
| unsloth Q8_0 | 9280MiB | 74.42t/s | 10472MiB | 0.7013 | 104.59t/s |
| bartowski Q8_0 | 9308MiB | 74.53t/s | 10500MiB | 0.693 | 105.23t/s |
| Qwen3.6-27B | NoMTP VRAM | NoMTP t/s | MTP3 VRAM | MTP3 acc% | MTP3 t/s |
|---|---|---|---|---|---|
| unsloth Q4_0 | 15870MiB | 41.43t/s | 17376MiB | 0.6829 | 63.46t/s |
| bartowski Q4_0 | 16352MiB | 41.16t/s | 17856MiB | 0.7188 | 64.84t/s |
| unsloth Q4_1 | 17208MiB | 39.68t/s | 18712MiB | 0.7011 | 66.03t/s |
| bartowski Q4_1 | 17682MiB | 38.73t/s | 19186MiB | 0.6853 | 63.15t/s |
| unsloth IQ4_NL | 16138MiB | 40.76t/s | 17644MiB | 0.6939 | 60.85t/s |
| bartowski IQ4_NL | 16328MiB | 40.67t/s | 17832MiB | 0.7241 | 65.19t/s |
| Qwen3.6-35B-A3B | NoMTP VRAM | NoMTP t/s | MTP3 VRAM | MTP3 acc% | MTP3 t/s |
|---|---|---|---|---|---|
| unsloth IQ4_NL | 18122MiB | 118.23t/s | 19368MiB | 0.6641 | 108.83t/s |
| bartowski IQ4_NL | 20482MiB | 127.58t/s | 21726MiB | 0.6881 | 112.53t/s |
Observations:
- For 4B, MTP is only faster for Q8_0. But you are paying 21.6% VRAM to gain 13.3% decoding speed.
- For 9B, MTP is faster across the board. Speed gain seems to correlate with the acceptance rate as expected
- For 27B, speed gain is now very significant at 53.2% for only 9.5% VRAM. This indicates the larger the dense model, it makes more sense to run MTP.
- For 35B-A3B MoE, only IQ4_NL is available from both. Strangely, the bartowski gguf is 13% larger than unsloth gguf but 8% faster.
- bartowski's ggufs are bigger than unsloth in general especially for the MoE model. Since there is no significant speed gain anyway, there is no reasons to use bartowski's MTP ggufs if speed is the only concern.
Please note that I only measure the speed but not perplexity or intelligence. So there might be advantages in these areas for the bartowski ggufs. This test is for single user, I presume MTP will bring more benefits for multi-users.
Overall, while MTP is nice to have, it often makes things worse. It is better you conduct your own tests to see if the extra VRAM is worth it for the decoding speed up (if there are any at all).
Does anyone know why the size difference is particularly big for the MoE ggufs?
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.