Need a second pair of eyes, this Qwen3.6 27B quant recipe consistently thinks less and is correct
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Ok, hear me out.
This all started when I was trying to understand why this Qwen3.6 27B INT8 Autoround (https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/tree/main) recipe was performing so much better than any other Qwen3.6 27B quant I tried. On some personal Rust + Bevy benchmarks, it was consistently outputting better code and games. I then noticed the model did a LOT less thinking. The INT8 model is great, but vLLM VRAM usage is higher. And since llama-cpp (in PR) has MTP, I figured I'd try to quant this and add MTP too.
What's interesting is both the INT8 autoround and my GGUF quant seem to perform better than UD Q8 K XL in terms of getting to the answer sooner. I choose to keep the same layers in BF16 as Minachist did. For my formal testing, I am using AIME math problems and then custom math problems that Opus 4.7 has created for me. The new quant is about the same size, just slightly bigger than UD Q8 K XL but the difference is surprisingly noticeable.
I think running these same tests in BF16 will reveal if this behavior is truly preferred or not. It may also just be that the thinking more is actually better, but my experience tells me the opposite. Nonetheless, here are some results.
My tests were against these quants (note these include MTP layers so they are slightly bigger):
- Q8_0 28595762432
- Size on disk is 29047084160 (28.3 GiB)
- Q8 K XL
- Size on disk is 35776484480 (34.9 GiB)
- This quant that I tried to copy layer for layer from the INT8 autoround recipe.
- Size on disk is 37144875200 bytes (36.2 GiB)
So is it really surprising that the bigger model size performed better? No. What's very interesting, though, is that the thinking is drastically less. So the KV cache space you lost by running a bigger quant is regained by spending 20% less tokens while thinking.
Here are some runs I did:
Note that all with same seed and sampling parameters. Multiple runs (3) resulted in same outputs. KV cache at bf16/bf16.
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --seed 1337 Question 1 (Math, AIME style)
The roots of \(x^3-7x^2+14x-8=0\) are \(a,b,c\). If \(\frac1{a^2+1}+\frac1{b^2+1}+\frac1{c^2+1}=\frac mn\) in lowest terms, find \(m+n\).
Llama CPP
- Q8
- 16,234 tokens for 3 min and 48 sec at 70.90 t/s (remember this is MTP with 2 tokens)
- UD Q8 K XL
- 16,001 tokens for 4 min and 00 sec at 66.24 t/s
- Custom Q8
- 9,671 tokens for 2 min and 39 sec at 60.60 t/s ~40% less thinking
vLLM
- Minachist INT8 autoround
- 10,200 tokens for 2 min and 38 sec at 34.2 t/s (I didn't use MTP here)
Question 2 (Math, AIME style)
How many ordered pairs of positive integers \((x,y)\) satisfy \(x^2-y^2=2026\)?
Llama CPP
- Q8
- 7,598 tokens for 1 min and 44 sec at 72.76 t/s
- Strange Q8 even did better
- 7,598 tokens for 1 min and 44 sec at 72.76 t/s
- Custom Q8
- 5,666 tokens for 1 min and 33 sec at 60.49 t/s
- ~59% less thinking
- 5,666 tokens for 1 min and 33 sec at 60.49 t/s
- UD Q8 K XL
- 13,596 tokens for 3 min and 29 sec at 65.02 t/s
vLLM
- Minachist INT8 autoround
- 8,931 tokens at 34.4 t/s (I didn't use MTP here)
There are a few more math tests I ran but you get the gist. The quant is thinking a lot less.
For anyone that wants to reproduce:
I downloaded the HF safe tensors and converted them to a single GGUF, then I used llama CPP to quant it down.
This is the minimum quant required to try it:
!Convert safetensor to GGUF /home/user/llm/llama.cpp/convert_hf_to_gguf.py /home/user/llm/models/Qwen3.6-27B/BF16 --outfile /home/user/llm/models/Qwen3.6-27B/BF16/Qwen3.6-27B-BF16.gguf !quant while keeping layers in BF16 /home/user/llm/llama.cpp/build/bin/llama-quantize \ --tensor-type token_embd=bf16 \ --tensor-type output=bf16 \ --tensor-type output_norm=bf16 \ --tensor-type post_attention_norm=bf16 \ --tensor-type attn_q_norm=bf16 \ --tensor-type attn_k_norm=bf16 \ --tensor-type attn_qkv=bf16 \ --tensor-type attn_gate=bf16 \ --tensor-type ssm_a=bf16 \ --tensor-type ssm_alpha=bf16 \ --tensor-type ssm_beta=bf16 \ --tensor-type ssm_conv1d=bf16 \ --tensor-type ssm_dt.bias=bf16 \ --tensor-type ssm_norm=bf16 \ --tensor-type ssm_out=bf16 \ /home/user/llm/models/Qwen3.6-27B/BF16/Qwen3.6-27B-BF16.gguf \ /home/user/llm/models/Qwen3.6-27B/BF16/Qwen3.6-27B-Q8-BIGBOY.gguf \ q8_0 Adding the following layers to the previous quant does NOT improve anything for me (saving about 1GB, I think):
--tensor-type attn_norm=bf16 \ --tensor-type attn_output=bf16 \ --tensor-type attn_q=bf16 \ --tensor-type attn_k=bf16 \ --tensor-type attn_v=bf16 \ Ideas why it might be good:
- Instead of F16, we're using BF16
- It's literally bigger, so more layers left in native format
- The layers we left at BF16 are important
Some limitations:
- I ran the tests only 3 times per model per question
- I should probably re-run the tests with another seed
- I didn't run benchmark suites. That would be helpful, but we also need to be mindful that Qwen is benchmaxed as shown in Contamination Detection via Context (CoDeC) benchmarks.
Next steps:
- I'll re-run the tests with another seed
- Rent runpod to run BF16 with same seed and samplings
[link] [comments]
More from r/LocalLLaMA
-
club-5060ti: practical RTX 5060 Ti local LLM notes and configs
May 15
-
MiniMax M2.7 ultra uncensored heretic is Out Now with 4/100 Refusals, Available in Safetensors and GGUFs Formats!
May 15
-
RDNA3 Flash Attention fix just dropped by llama.cpp b9158
May 15
-
I trained Qwen3.5 to jailbreak itself with RL, then used the failures to improve its defenses
May 14
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.