Hello everyone!
I want to share the result of my experiment to make Qwen3.6 27B Q4_K_M fits in to my RTX 5060 Ti 16 GB. Inspired by u/Due-Project-7507's work on Ununnilium/Qwen3.6-27B-IQ4_XS-pure-GGUF.
Using the same pure quantization method, I was able to create a Q4_K_M ggufs that fit completely in 16 GB VRAM.
Model URL: https://huggingface.co/huytd189/Qwen3.6-27B-pure-GGUF
There are two versions Q4_K_M MTP (15.4 GB) and Q4_K_M non-MTP (15.1 GB).
You can download the GGUF and run with the latest llama.cpp version this way:
llama-server -m Qwen3.6-27B-MTP-Q4_K_M-pure.gguf -fitt 128 -c 65536 -fa on -np 1 -ctk q5_0 -ctv q5_0 -ctxcp 18 --no-mmap --mlock --no-warmup --chat-template-kwargs '{"preserve_thinking": true}' --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -ub 256 -b 1024 -ngl 99 --spec-type draft-mtp --spec-draft-n-max 2
TOKEN SPEED
With the MTP version, I got 40 tok/s for tg, but slower pp, while the non-MTP version has higher pp and tg at 24 tok/s.
| Version | Prompt Processing | Token Generation |
| MTP | 195 tok/s | 40 tok/s |
| Non MTP | 715 tok/s | 24 tok/s |
MODEL SIZE
https://preview.redd.it/74ehd6vyvr2h1.png?width=5845&format=png&auto=webp&s=a66ba493ea1eb7fb61c999a47670c093700b9a97
MTP Version:
| Model | Size |
| huytd/Qwen3.6-27B-pure-GGUF Q4_K_M MTP | 15.4 GB |
| froggeric/Qwen3.6-27B-MTP-GGUF Q4_K_M MTP | 16.8 GB |
| unsloth/Qwen3.6-27B-MTP-GGUF Q4_K_M MTP | 17.1 GB |
Non MTP Version:
| Model | Size |
| huytd/Qwen3.6-27B-pure-GGUF Q4_K_M | 15.1 GB |
| mradermacher/Qwen3.6-27B-GGUF Q4_K_M | 16.5 GB |
| unsloth/Qwen3.6-27B-GGUF Q4_K_M | 16.8 GB |
| bartowski/Qwen_Qwen3.6-27B-GGUF Q4_K_M | 18 GB |
PERPLEXITY DIFFERENCE
Currently I don't have the hardware that can run KLD benchmark, so just showing PPL difference here, but it should be good for you to get the trade-offs between quality and the size reduciton here.
https://preview.redd.it/lepgzq18wr2h1.png?width=4968&format=png&auto=webp&s=ece2b3f99f1406d0f46e3665e31b65a3b50fe7e7
| Variant | PPL | Delta |
| BF16 MTP | 7.5992 +/- 0.02890 | base |
| This Q4_K_M MTP | 7.7699 +/- 0.02972 | +0.1707 |
| Unsloth's Q4_K_M MTP | 7.6545 +/- 0.02913 | +0.0553 |
| BF16 non-MTP | 7.5992 +/- 0.02890 | base |
| This Q4_K_M non-MTP | 7.7043 +/- 0.02935 | +0.1051 |
| Unsloth's Q4_K_M non-MTP | 7.6532 +/- 0.02912 | +0.0540 |
submitted by
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.