r/LocalLLaMA · · 2 min read

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Hello everyone!

I want to share the result of my experiment to make Qwen3.6 27B Q4_K_M fits in to my RTX 5060 Ti 16 GB. Inspired by u/Due-Project-7507's work on Ununnilium/Qwen3.6-27B-IQ4_XS-pure-GGUF.

Using the same pure quantization method, I was able to create a Q4_K_M ggufs that fit completely in 16 GB VRAM.

Model URL: https://huggingface.co/huytd189/Qwen3.6-27B-pure-GGUF

There are two versions Q4_K_M MTP (15.4 GB) and Q4_K_M non-MTP (15.1 GB).

You can download the GGUF and run with the latest llama.cpp version this way:

llama-server -m Qwen3.6-27B-MTP-Q4_K_M-pure.gguf -fitt 128 -c 65536 -fa on -np 1 -ctk q5_0 -ctv q5_0 -ctxcp 18 --no-mmap --mlock --no-warmup --chat-template-kwargs '{"preserve_thinking": true}' --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -ub 256 -b 1024 -ngl 99 --spec-type draft-mtp --spec-draft-n-max 2 

TOKEN SPEED

With the MTP version, I got 40 tok/s for tg, but slower pp, while the non-MTP version has higher pp and tg at 24 tok/s.

Version Prompt Processing Token Generation
MTP 195 tok/s 40 tok/s
Non MTP 715 tok/s 24 tok/s

MODEL SIZE

https://preview.redd.it/74ehd6vyvr2h1.png?width=5845&format=png&auto=webp&s=a66ba493ea1eb7fb61c999a47670c093700b9a97

MTP Version:

Model Size
huytd/Qwen3.6-27B-pure-GGUF Q4_K_M MTP 15.4 GB
froggeric/Qwen3.6-27B-MTP-GGUF Q4_K_M MTP 16.8 GB
unsloth/Qwen3.6-27B-MTP-GGUF Q4_K_M MTP 17.1 GB

Non MTP Version:

Model Size
huytd/Qwen3.6-27B-pure-GGUF Q4_K_M 15.1 GB
mradermacher/Qwen3.6-27B-GGUF Q4_K_M 16.5 GB
unsloth/Qwen3.6-27B-GGUF Q4_K_M 16.8 GB
bartowski/Qwen_Qwen3.6-27B-GGUF Q4_K_M 18 GB

PERPLEXITY DIFFERENCE

Currently I don't have the hardware that can run KLD benchmark, so just showing PPL difference here, but it should be good for you to get the trade-offs between quality and the size reduciton here.

https://preview.redd.it/lepgzq18wr2h1.png?width=4968&format=png&auto=webp&s=ece2b3f99f1406d0f46e3665e31b65a3b50fe7e7

Variant PPL Delta
BF16 MTP 7.5992 +/- 0.02890 base
This Q4_K_M MTP 7.7699 +/- 0.02972 +0.1707
Unsloth's Q4_K_M MTP 7.6545 +/- 0.02913 +0.0553
BF16 non-MTP 7.5992 +/- 0.02890 base
This Q4_K_M non-MTP 7.7043 +/- 0.02935 +0.1051
Unsloth's Q4_K_M non-MTP 7.6532 +/- 0.02912 +0.0540
submitted by /u/bobaburger
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA