This is not a diss to Unsloth, they make great quants and really move this community forward.
I've been experimenting with quanting specific sublayers based on which ones have the most outliers post Q8 quant. I basically did a BF16 to Q8_0 conversion and looked at the post quant values to compare. I found several layers that had a CRAZY high number of outliers. I'm not certain this is better, but the results are interesting!
I still need to upload the Q8 quant to hugging face, but here are some initial benchmarks.
Some limitations:
- The dataset used here was wiki.test.raw at -c 2048 and --chunks 200
- I think it's possible that other datasets could show different outliers
- I didn't run any benchmarks to show performance on actual tests (e.g. coding)
- The Q8-CC has a worse perplexity but better top p and KLD than UD Q8 K XL.
Quick summary:
35776484480 (33.31GiB) Qwen3.6-27B-UD-Q8_K_XL.gguf
32726111136 (30.47GiB) Qwen3.6-27B-Q8-CC.gguf
https://preview.redd.it/w0jhv0pxua5h1.png?width=824&format=png&auto=webp&s=fe78bad7b13099a52dfabe89728976fa079c1289
| Metric | Qwen3.6-27B-UD-Q8_K_XL | Qwen3.6-27B-Q8-CC |
| Mean KLD | 0.012100 ± 0.000836 | 0.011324 ± 0.000790 |
| Maximum KLD | 24.382509 | 24.220026 |
| 99.9% KLD | 2.473664 | 2.506243 |
| 99.0% KLD | 0.024188 | 0.023331 |
| 95.0% KLD | 0.005269 | 0.003847 |
| 90.0% KLD | 0.003549 | 0.002324 |
| Median KLD | 0.000954 | 0.000499 |
| 10.0% KLD | 0.000009 | 0.000004 |
| 5.0% KLD | 0.000002 | 0.000001 |
| 1.0% KLD | -0.000001 | -0.000001 |
| 0.1% KLD | -0.000007 | -0.00001 |
| Minimum KLD | -0.000054 | -0.000112 |
https://preview.redd.it/yofs0o91va5h1.png?width=718&format=png&auto=webp&s=4989043a306ee5681ee316ccffa13a27be1d7b3d
| Metric | Qwen3.6-27B-UD-Q8_K_XL | Qwen3.6-27B-Q8-CC |
| Mean Δp | -0.005% ± 0.006% | -0.027% ± 0.006% |
| Maximum Δp | 99.59% | 99.80% |
| 99.9% Δp | 15.23% | 13.59% |
| 99.0% Δp | 4.09% | 3.08% |
| 95.0% Δp | 2.07% | 1.56% |
| 90.0% Δp | 1.19% | 0.69% |
| 75.0% Δp | 0.21% | 0.08% |
| Median Δp | 0.00% | 0.00% |
| 25.0% Δp | -0.24% | -0.08% |
| 10.0% Δp | -1.23% | -0.77% |
| 5.0% Δp | -2.10% | -1.68% |
| 1.0% Δp | -4.16% | -3.21% |
| 0.1% Δp | -12.02% | -16.60% |
| Minimum Δp | -99.92% | -99.92% |
| RMS Δp | 2.340% ± 0.080% | 2.305% ± 0.084% |
| Same top p | 97.426% ± 0.041% | 98.358% ± 0.033% |
The recipe for the Qwen3.6-27B-Q8-CC.gguf quant:
/home/user/llm/llama.cpp/build/bin/llama-quantize \ --token-embedding-type bf16 \ --tensor-type output_norm=bf16 \ --tensor-type attn_k=bf16 \ --tensor-type attn_v=bf16 \ --tensor-type post_attention_norm=bf16 \ --tensor-type attn_q_norm=bf16 \ --tensor-type attn_k_norm=bf16 \ --tensor-type attn_norm=bf16 \ --tensor-type ssm_a=bf16 \ --tensor-type ssm_alpha=bf16 \ --tensor-type ssm_beta=bf16 \ --tensor-type ssm_conv1d=bf16 \ --tensor-type ssm_dt.bias=bf16 \ --tensor-type ssm_norm=bf16 \ --tensor-type nextn.eh_proj=bf16 \ --tensor-type blk.34.attn_gate=bf16 \ --tensor-type blk.19.attn_output=bf16 \ --tensor-type blk.11.attn_q=bf16 \ --tensor-type blk.63.attn_q=bf16 \ --tensor-type blk.27.attn_q=bf16 \ --tensor-type blk.0.attn_qkv=bf16 \ --tensor-type blk.37.attn_qkv=bf16 \ --tensor-type blk.28.attn_qkv=bf16 \ --tensor-type blk.6.ffn_down=bf16 \ --tensor-type blk.64.ffn_down=bf16 \ --tensor-type blk.0.ffn_down=bf16 \ --tensor-type blk.63.ffn_gate=bf16 \ --tensor-type blk.62.ffn_gate=bf16 \ --tensor-type blk.63.ffn_up=bf16 \ --tensor-type blk.62.ffn_up=bf16 \ --tensor-type blk.37.ssm_out=bf16 \ --tensor-type blk.0.ssm_out=bf16 \ --tensor-type blk.34.ssm_out=bf16 \ --output-tensor-type bf16 \ /home/user/llm/models/Qwen3.6-27B/Qwen3.6-27B-BF16-00001-of-00002.gguf \ /home/user/llm/models/Qwen3.6-27B/Qwen3.6-27B-Q8-CC.gguf \ q8_0
RAW DATA:
The baseline here is Qwen 3.6 27B BF16 with KV cache BF16
NORMAL Q8, nothing custom:
====== Perplexity statistics ====== Mean PPL(Q) : 6.655412 ± 0.045246 Mean PPL(base) : 6.636486 ± 0.044736 Cor(ln(PPL(Q)), ln(PPL(base))): 99.52% Mean ln(PPL(Q)/PPL(base)) : 0.002848 ± 0.000667 Mean PPL(Q)/PPL(base) : 1.002852 ± 0.000668 Mean PPL(Q)-PPL(base) : 0.018927 ± 0.004442 ====== KL divergence statistics ====== Mean KLD: 0.012557 ± 0.000850 Maximum KLD: 24.464790 99.9% KLD: 2.964850 99.0% KLD: 0.028737 95.0% KLD: 0.003968 90.0% KLD: 0.002280 Median KLD: 0.000562 10.0% KLD: 0.000007 5.0% KLD: 0.000001 1.0% KLD: -0.000001 0.1% KLD: -0.000006 Minimum KLD: -0.000057 ====== Token probability statistics ====== Mean Δp: -0.017 ± 0.006 % Maximum Δp: 99.818% 99.9% Δp: 15.451% 99.0% Δp: 3.027% 95.0% Δp: 1.402% 90.0% Δp: 0.821% 75.0% Δp: 0.152% Median Δp: -0.000% 25.0% Δp: -0.179% 10.0% Δp: -0.885% 5.0% Δp: -1.477% 1.0% Δp: -3.127% 0.1% Δp: -13.658% Minimum Δp: -99.648% RMS Δp : 2.350 ± 0.085 % Same top p: 97.771 ± 0.038 %
Qwen3.6-27B-UD-Q8_K_XL.gguf
35776484480 (33.31GiB) Qwen3.6-27B-UD-Q8_K_XL.gguf
====== Perplexity statistics ====== Mean PPL(Q) : 6.663686 ± 0.045346 Mean PPL(base) : 6.636486 ± 0.044736 Cor(ln(PPL(Q)), ln(PPL(base))): 99.54% Mean ln(PPL(Q)/PPL(base)) : 0.004090 ± 0.000656 Mean PPL(Q)/PPL(base) : 1.004099 ± 0.000659 Mean PPL(Q)-PPL(base) : 0.027200 ± 0.004384 ====== KL divergence statistics ====== Mean KLD: 0.012100 ± 0.000836 Maximum KLD: 24.382509 99.9% KLD: 2.473664 99.0% KLD: 0.024188 95.0% KLD: 0.005269 90.0% KLD: 0.003549 Median KLD: 0.000954 10.0% KLD: 0.000009 5.0% KLD: 0.000002 1.0% KLD: -0.000001 0.1% KLD: -0.000007 Minimum KLD: -0.000054 ====== Token probability statistics ====== Mean Δp: -0.005 ± 0.006 % Maximum Δp: 99.594% 99.9% Δp: 15.232% 99.0% Δp: 4.091% 95.0% Δp: 2.066% 90.0% Δp: 1.186% 75.0% Δp: 0.214% Median Δp: -0.000% 25.0% Δp: -0.236% 10.0% Δp: -1.229% 5.0% Δp: -2.097% 1.0% Δp: -4.163% 0.1% Δp: -12.016% Minimum Δp: -99.923% RMS Δp : 2.340 ± 0.080 % Same top p: 97.426 ± 0.041 %
Qwen3.6-27B-Q8-CC.gguf
32726111136 (30.47GiB) Qwen3.6-27B-Q8-CC.gguf
Note that PPL seems worse here but token probability and KL divergence seem better.
====== Perplexity statistics ====== Mean PPL(Q) : 6.681999 ± 0.045554 Mean PPL(base) : 6.636486 ± 0.044736 Cor(ln(PPL(Q)), ln(PPL(base))): 99.49% Mean ln(PPL(Q)/PPL(base)) : 0.006835 ± 0.000688 Mean PPL(Q)/PPL(base) : 1.006858 ± 0.000693 Mean PPL(Q)-PPL(base) : 0.045513 ± 0.004626 ====== KL divergence statistics ====== Mean KLD: 0.011324 ± 0.000790 Maximum KLD: 24.220026 99.9% KLD: 2.506243 99.0% KLD: 0.023331 95.0% KLD: 0.003847 90.0% KLD: 0.002324 Median KLD: 0.000499 10.0% KLD: 0.000004 5.0% KLD: 0.000001 1.0% KLD: -0.000001 0.1% KLD: -0.000010 Minimum KLD: -0.000112 ====== Token probability statistics ====== Mean Δp: -0.027 ± 0.006 % Maximum Δp: 99.801% 99.9% Δp: 13.591% 99.0% Δp: 3.079% 95.0% Δp: 1.560% 90.0% Δp: 0.686% 75.0% Δp: 0.077% Median Δp: 0.000% 25.0% Δp: -0.084% 10.0% Δp: -0.770% 5.0% Δp: -1.682% 1.0% Δp: -3.208% 0.1% Δp: -16.596% Minimum Δp: -99.918% RMS Δp : 2.305 ± 0.084 % Same top p: 98.358 ± 0.033 %
For extra points, here's another quant that's still smaller than UD Q8 K XL and performs better on multiple metrics.
Qwen3.6-27B-Q8-CC-5.gguf
35144389536 (32.73GB) Qwen3.6-27B-Q8-CC-5.gguf
====== Perplexity statistics ====== Mean PPL(Q) : 6.670677 ± 0.045414 Mean PPL(base) : 6.636486 ± 0.044736 Cor(ln(PPL(Q)), ln(PPL(base))): 99.59% Mean ln(PPL(Q)/PPL(base)) : 0.005139 ± 0.000618 Mean PPL(Q)/PPL(base) : 1.005152 ± 0.000621 Mean PPL(Q)-PPL(base) : 0.034192 ± 0.004145 ====== KL divergence statistics ====== Mean KLD: 0.010970 ± 0.000828 Maximum KLD: 25.486208 99.9% KLD: 1.975405 99.0% KLD: 0.021026 95.0% KLD: 0.003457 90.0% KLD: 0.002151 Median KLD: 0.000438 10.0% KLD: 0.000003 5.0% KLD: 0.000001 1.0% KLD: -0.000002 0.1% KLD: -0.000011 Minimum KLD: -0.000480 ====== Token probability statistics ====== Mean Δp: -0.020 ± 0.006 % Maximum Δp: 99.828% 99.9% Δp: 13.630% 99.0% Δp: 3.038% 95.0% Δp: 1.474% 90.0% Δp: 0.643% 75.0% Δp: 0.072% Median Δp: 0.000% 25.0% Δp: -0.073% 10.0% Δp: -0.714% 5.0% Δp: -1.669% 1.0% Δp: -3.113% 0.1% Δp: -12.475% Minimum Δp: -99.916% RMS Δp : 2.201 ± 0.084 % Same top p: 98.453 ± 0.032 %
And here's the recipe for CC-5
/home/user/llm/llama.cpp/build/bin/llama-quantize \ --token-embedding-type bf16 \ --tensor-type output_norm=bf16 \ --tensor-type attn_k=bf16 \ --tensor-type post_attention_norm=bf16 \ --tensor-type attn_q_norm=bf16 \ --tensor-type attn_k_norm=bf16 \ --tensor-type attn_norm=bf16 \ --tensor-type ssm_a=bf16 \ --tensor-type ssm_alpha=bf16 \ --tensor-type ssm_beta=bf16 \ --tensor-type ssm_conv1d=bf16 \ --tensor-type ssm_dt.bias=bf16 \ --tensor-type ssm_norm=bf16 \ --tensor-type nextn.eh_proj=bf16 \ --tensor-type blk.34.attn_gate=bf16 \ --tensor-type blk.6.attn_gate=bf16 \ --tensor-type blk.18.attn_gate=bf16 \ --tensor-type blk.37.attn_gate=bf16 \ --tensor-type blk.4.attn_gate=bf16 \ --tensor-type blk.5.attn_gate=bf16 \ --tensor-type blk.1.attn_gate=bf16 \ --tensor-type blk.0.attn_gate=bf16 \ --tensor-type blk.40.attn_gate=bf16 \ --tensor-type blk.2.attn_gate=bf16 \ --tensor-type blk.10.attn_gate=bf16 \ --tensor-type blk.8.attn_gate=bf16 \ --tensor-type blk.9.attn_gate=bf16 \ --tensor-type blk.16.attn_gate=bf16 \ --tensor-type blk.11.attn_q=bf16 \ --tensor-type blk.63.attn_q=bf16 \ --tensor-type blk.27.attn_q=bf16 \ --tensor-type blk.43.attn_q=bf16 \ --tensor-type blk.59.attn_q=bf16 \ --tensor-type blk.47.attn_q=bf16 \ --tensor-type blk.51.attn_q=bf16 \ --tensor-type blk.3.attn_q=bf16 \ --tensor-type blk.7.attn_q=bf16 \ --tensor-type blk.35.attn_q=bf16 \ --tensor-type blk.0.attn_qkv=bf16 \ --tensor-type blk.37.attn_qkv=bf16 \ --tensor-type blk.28.attn_qkv=bf16 \ --tensor-type blk.40.attn_qkv=bf16 \ --tensor-type blk.32.attn_qkv=bf16 \ --tensor-type blk.36.attn_qkv=bf16 \ --tensor-type blk.33.attn_qkv=bf16 \ --tensor-type blk.34.attn_qkv=bf16 \ --tensor-type blk.30.attn_qkv=bf16 \ --tensor-type blk.63.attn_v=bf16 \ --tensor-type blk.59.attn_v=bf16 \ --tensor-type blk.51.attn_v=bf16 \ --tensor-type blk.55.attn_v=bf16 \ --tensor-type blk.35.attn_v=bf16 \ --tensor-type blk.43.attn_v=bf16 \ --tensor-type blk.19.attn_v=bf16 \ --tensor-type blk.47.attn_v=bf16 \ --tensor-type blk.27.attn_v=bf16 \ --tensor-type blk.39.attn_v=bf16 \ --tensor-type blk.37.ssm_out=bf16 \ --tensor-type blk.0.ssm_out=bf16 \ --tensor-type blk.34.ssm_out=bf16 \ --tensor-type blk.2.ssm_out=bf16 \ --tensor-type blk.18.ssm_out=bf16 \ --tensor-type blk.6.ssm_out=bf16 \ --tensor-type blk.21.ssm_out=bf16 \ --tensor-type blk.1.ssm_out=bf16 \ --tensor-type blk.30.ssm_out=bf16 \ --tensor-type blk.26.ssm_out=bf16 \ --tensor-type blk.4.ssm_out=bf16 \ --tensor-type blk.10.ssm_out=bf16 \ --tensor-type blk.5.ssm_out=bf16 \ --tensor-type blk.14.ssm_out=bf16 \ --tensor-type blk.25.ssm_out=bf16 \ --tensor-type blk.12.ssm_out=bf16 \ --tensor-type blk.8.ssm_out=bf16 \ --tensor-type blk.28.ssm_out=bf16 \ --tensor-type blk.9.ssm_out=bf16 \ --tensor-type blk.63.ffn_up=bf16 \ --tensor-type blk.62.ffn_up=bf16 \ --tensor-type blk.61.ffn_up=bf16 \ --tensor-type blk.22.ffn_up=bf16 \ --tensor-type blk.63.ffn_gate=bf16 \ --tensor-type blk.50.ffn_gate=bf16 \ --tensor-type blk.49.ffn_gate=bf16 \ --tensor-type blk.34.ffn_gate=bf16 \ --tensor-type blk.61.ffn_gate=bf16 \ --tensor-type blk.62.ffn_gate=bf16 \ --tensor-type blk.6.ffn_down=bf16 \ --tensor-type blk.64.ffn_down=bf16 \ --tensor-type blk.22.ffn_down=bf16 \ --tensor-type blk.18.ffn_down=bf16 \ --tensor-type blk.63.ffn_down=bf16 \ --tensor-type blk.0.ffn_down=bf16 \ --tensor-type blk.1.ffn_down=bf16 \ --tensor-type blk.62.ffn_down=bf16 \ --output-tensor-type bf16 \ /home/user/llm/models/Qwen3.6-27B/Qwen3.6-27B-BF16-00001-of-00002.gguf \ /home/user/llm/models/Qwen3.6-27B/Qwen3.6-27B-Q8-CC-5.gguf \ q8_0
Q8 K XL vs CC-5:
https://preview.redd.it/fkkmks72wa5h1.png?width=585&format=png&auto=webp&s=b37a2c2c75687e61c13753700f4b42dbf6d3282c
| Metric | Qwen3.6-27B-UD-Q8_K_XL | Qwen3.6-27B-Q8-CC-5 |
| Mean KLD | 0.012100 ± 0.000836 | 0.010970 ± 0.000828 |
| Maximum KLD | 24.382509 | 25.486208 |
| 99.9% KLD | 2.473664 | 1.975405 |
| 99.0% KLD | 0.024188 | 0.021026 |
| 95.0% KLD | 0.005269 | 0.003457 |
| 90.0% KLD | 0.003549 | 0.002151 |
| Median KLD | 0.000954 | 0.000438 |
| 10.0% KLD | 0.000009 | 0.000003 |
| 5.0% KLD | 0.000002 | 0.000001 |
| 1.0% KLD | -0.000001 | -0.000002 |
| 0.1% KLD | -0.000007 | -0.000011 |
| Minimum KLD | -0.000054 | -0.00048 |
| Metric | Qwen3.6-27B-UD-Q8_K_XL | Qwen3.6-27B-Q8-CC-5 |
| Mean Δp | -0.005% ± 0.006% | -0.020% ± 0.006% |
| Maximum Δp | 99.59% | 99.83% |
| 99.9% Δp | 15.23% | 13.63% |
| 99.0% Δp | 4.09% | 3.04% |
| 95.0% Δp | 2.07% | 1.47% |
| 90.0% Δp | 1.19% | 0.64% |
| 75.0% Δp | 0.21% | 0.07% |
| Median Δp | 0.00% | 0.00% |
| 25.0% Δp | -0.24% | -0.07% |
| 10.0% Δp | -1.23% | -0.71% |
| 5.0% Δp | -2.10% | -1.67% |
| 1.0% Δp | -4.16% | -3.11% |
| 0.1% Δp | -12.02% | -12.48% |
| Minimum Δp | -99.92% | -99.92% |
| RMS Δp | 2.340% ± 0.080% | 2.201% ± 0.084% |
| Same top p | 97.426% ± 0.041% | 98.453% ± 0.032% |
submitted by
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.