r/LocalLLaMA · May 22, 2026 · 2 min read

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Hello everyone!

I want to share the result of my experiment to make Qwen3.6 27B Q4_K_M fits in to my RTX 5060 Ti 16 GB. Inspired by u/Due-Project-7507's work on Ununnilium/Qwen3.6-27B-IQ4_XS-pure-GGUF.

Using the same pure quantization method, I was able to create a Q4_K_M ggufs that fit completely in 16 GB VRAM.

Model URL: https://huggingface.co/huytd189/Qwen3.6-27B-pure-GGUF

There are two versions Q4_K_M MTP (15.4 GB) and Q4_K_M non-MTP (15.1 GB).

You can download the GGUF and run with the latest llama.cpp version this way:

llama-server -m Qwen3.6-27B-MTP-Q4_K_M-pure.gguf -fitt 128 -c 65536 -fa on -np 1 -ctk q5_0 -ctv q5_0 -ctxcp 18 --no-mmap --mlock --no-warmup --chat-template-kwargs '{"preserve_thinking": true}' --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -ub 256 -b 1024 -ngl 99 --spec-type draft-mtp --spec-draft-n-max 2

TOKEN SPEED

With the MTP version, I got 40 tok/s for tg, but slower pp, while the non-MTP version has higher pp and tg at 24 tok/s.

Version	Prompt Processing	Token Generation
MTP	195 tok/s	40 tok/s
Non MTP	715 tok/s	24 tok/s

MODEL SIZE

https://preview.redd.it/74ehd6vyvr2h1.png?width=5845&format=png&auto=webp&s=a66ba493ea1eb7fb61c999a47670c093700b9a97

MTP Version:

Model	Size
huytd/Qwen3.6-27B-pure-GGUF Q4_K_M MTP	15.4 GB
froggeric/Qwen3.6-27B-MTP-GGUF Q4_K_M MTP	16.8 GB
unsloth/Qwen3.6-27B-MTP-GGUF Q4_K_M MTP	17.1 GB

Non MTP Version:

Model	Size
huytd/Qwen3.6-27B-pure-GGUF Q4_K_M	15.1 GB
mradermacher/Qwen3.6-27B-GGUF Q4_K_M	16.5 GB
unsloth/Qwen3.6-27B-GGUF Q4_K_M	16.8 GB
bartowski/Qwen_Qwen3.6-27B-GGUF Q4_K_M	18 GB

PERPLEXITY DIFFERENCE

Currently I don't have the hardware that can run KLD benchmark, so just showing PPL difference here, but it should be good for you to get the trade-offs between quality and the size reduciton here.

https://preview.redd.it/lepgzq18wr2h1.png?width=4968&format=png&auto=webp&s=ece2b3f99f1406d0f46e3665e31b65a3b50fe7e7

Variant	PPL	Delta
BF16 MTP	7.5992 +/- 0.02890	base
This Q4_K_M MTP	7.7699 +/- 0.02972	+0.1707
Unsloth's Q4_K_M MTP	7.6545 +/- 0.02913	+0.0553
BF16 non-MTP	7.5992 +/- 0.02890	base
This Q4_K_M non-MTP	7.7043 +/- 0.02935	+0.1051
Unsloth's Q4_K_M non-MTP	7.6532 +/- 0.02912	+0.0540

submitted by /u/bobaburger
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA