r/LocalLLaMA · · 7 min read

Qwen 3.6 27B 30GB Same top p: 98.358 ± 0.033 % vs UD Q8 K XL 33GB Same top p: 97.426 ± 0.041 %

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Qwen 3.6 27B 30GB Same top p: 98.358 ± 0.033 % vs UD Q8 K XL 33GB Same top p: 97.426 ± 0.041 %

This is not a diss to Unsloth, they make great quants and really move this community forward.

I've been experimenting with quanting specific sublayers based on which ones have the most outliers post Q8 quant. I basically did a BF16 to Q8_0 conversion and looked at the post quant values to compare. I found several layers that had a CRAZY high number of outliers. I'm not certain this is better, but the results are interesting!

I still need to upload the Q8 quant to hugging face, but here are some initial benchmarks.

Some limitations:

  • The dataset used here was wiki.test.raw at -c 2048 and --chunks 200
  • I think it's possible that other datasets could show different outliers
  • I didn't run any benchmarks to show performance on actual tests (e.g. coding)
  • The Q8-CC has a worse perplexity but better top p and KLD than UD Q8 K XL.

Quick summary:

35776484480 (33.31GiB) Qwen3.6-27B-UD-Q8_K_XL.gguf

32726111136 (30.47GiB) Qwen3.6-27B-Q8-CC.gguf

https://preview.redd.it/w0jhv0pxua5h1.png?width=824&format=png&auto=webp&s=fe78bad7b13099a52dfabe89728976fa079c1289

Metric Qwen3.6-27B-UD-Q8_K_XL Qwen3.6-27B-Q8-CC
Mean KLD 0.012100 ± 0.000836 0.011324 ± 0.000790
Maximum KLD 24.382509 24.220026
99.9% KLD 2.473664 2.506243
99.0% KLD 0.024188 0.023331
95.0% KLD 0.005269 0.003847
90.0% KLD 0.003549 0.002324
Median KLD 0.000954 0.000499
10.0% KLD 0.000009 0.000004
5.0% KLD 0.000002 0.000001
1.0% KLD -0.000001 -0.000001
0.1% KLD -0.000007 -0.00001
Minimum KLD -0.000054 -0.000112

https://preview.redd.it/yofs0o91va5h1.png?width=718&format=png&auto=webp&s=4989043a306ee5681ee316ccffa13a27be1d7b3d

Metric Qwen3.6-27B-UD-Q8_K_XL Qwen3.6-27B-Q8-CC
Mean Δp -0.005% ± 0.006% -0.027% ± 0.006%
Maximum Δp 99.59% 99.80%
99.9% Δp 15.23% 13.59%
99.0% Δp 4.09% 3.08%
95.0% Δp 2.07% 1.56%
90.0% Δp 1.19% 0.69%
75.0% Δp 0.21% 0.08%
Median Δp 0.00% 0.00%
25.0% Δp -0.24% -0.08%
10.0% Δp -1.23% -0.77%
5.0% Δp -2.10% -1.68%
1.0% Δp -4.16% -3.21%
0.1% Δp -12.02% -16.60%
Minimum Δp -99.92% -99.92%
RMS Δp 2.340% ± 0.080% 2.305% ± 0.084%
Same top p 97.426% ± 0.041% 98.358% ± 0.033%

The recipe for the Qwen3.6-27B-Q8-CC.gguf quant:

/home/user/llm/llama.cpp/build/bin/llama-quantize \ --token-embedding-type bf16 \ --tensor-type output_norm=bf16 \ --tensor-type attn_k=bf16 \ --tensor-type attn_v=bf16 \ --tensor-type post_attention_norm=bf16 \ --tensor-type attn_q_norm=bf16 \ --tensor-type attn_k_norm=bf16 \ --tensor-type attn_norm=bf16 \ --tensor-type ssm_a=bf16 \ --tensor-type ssm_alpha=bf16 \ --tensor-type ssm_beta=bf16 \ --tensor-type ssm_conv1d=bf16 \ --tensor-type ssm_dt.bias=bf16 \ --tensor-type ssm_norm=bf16 \ --tensor-type nextn.eh_proj=bf16 \ --tensor-type blk.34.attn_gate=bf16 \ --tensor-type blk.19.attn_output=bf16 \ --tensor-type blk.11.attn_q=bf16 \ --tensor-type blk.63.attn_q=bf16 \ --tensor-type blk.27.attn_q=bf16 \ --tensor-type blk.0.attn_qkv=bf16 \ --tensor-type blk.37.attn_qkv=bf16 \ --tensor-type blk.28.attn_qkv=bf16 \ --tensor-type blk.6.ffn_down=bf16 \ --tensor-type blk.64.ffn_down=bf16 \ --tensor-type blk.0.ffn_down=bf16 \ --tensor-type blk.63.ffn_gate=bf16 \ --tensor-type blk.62.ffn_gate=bf16 \ --tensor-type blk.63.ffn_up=bf16 \ --tensor-type blk.62.ffn_up=bf16 \ --tensor-type blk.37.ssm_out=bf16 \ --tensor-type blk.0.ssm_out=bf16 \ --tensor-type blk.34.ssm_out=bf16 \ --output-tensor-type bf16 \ /home/user/llm/models/Qwen3.6-27B/Qwen3.6-27B-BF16-00001-of-00002.gguf \ /home/user/llm/models/Qwen3.6-27B/Qwen3.6-27B-Q8-CC.gguf \ q8_0 

RAW DATA:

The baseline here is Qwen 3.6 27B BF16 with KV cache BF16

NORMAL Q8, nothing custom:

====== Perplexity statistics ====== Mean PPL(Q) : 6.655412 ± 0.045246 Mean PPL(base) : 6.636486 ± 0.044736 Cor(ln(PPL(Q)), ln(PPL(base))): 99.52% Mean ln(PPL(Q)/PPL(base)) : 0.002848 ± 0.000667 Mean PPL(Q)/PPL(base) : 1.002852 ± 0.000668 Mean PPL(Q)-PPL(base) : 0.018927 ± 0.004442 ====== KL divergence statistics ====== Mean KLD: 0.012557 ± 0.000850 Maximum KLD: 24.464790 99.9% KLD: 2.964850 99.0% KLD: 0.028737 95.0% KLD: 0.003968 90.0% KLD: 0.002280 Median KLD: 0.000562 10.0% KLD: 0.000007 5.0% KLD: 0.000001 1.0% KLD: -0.000001 0.1% KLD: -0.000006 Minimum KLD: -0.000057 ====== Token probability statistics ====== Mean Δp: -0.017 ± 0.006 % Maximum Δp: 99.818% 99.9% Δp: 15.451% 99.0% Δp: 3.027% 95.0% Δp: 1.402% 90.0% Δp: 0.821% 75.0% Δp: 0.152% Median Δp: -0.000% 25.0% Δp: -0.179% 10.0% Δp: -0.885% 5.0% Δp: -1.477% 1.0% Δp: -3.127% 0.1% Δp: -13.658% Minimum Δp: -99.648% RMS Δp : 2.350 ± 0.085 % Same top p: 97.771 ± 0.038 % 

Qwen3.6-27B-UD-Q8_K_XL.gguf

35776484480 (33.31GiB) Qwen3.6-27B-UD-Q8_K_XL.gguf

 ====== Perplexity statistics ====== Mean PPL(Q) : 6.663686 ± 0.045346 Mean PPL(base) : 6.636486 ± 0.044736 Cor(ln(PPL(Q)), ln(PPL(base))): 99.54% Mean ln(PPL(Q)/PPL(base)) : 0.004090 ± 0.000656 Mean PPL(Q)/PPL(base) : 1.004099 ± 0.000659 Mean PPL(Q)-PPL(base) : 0.027200 ± 0.004384 ====== KL divergence statistics ====== Mean KLD: 0.012100 ± 0.000836 Maximum KLD: 24.382509 99.9% KLD: 2.473664 99.0% KLD: 0.024188 95.0% KLD: 0.005269 90.0% KLD: 0.003549 Median KLD: 0.000954 10.0% KLD: 0.000009 5.0% KLD: 0.000002 1.0% KLD: -0.000001 0.1% KLD: -0.000007 Minimum KLD: -0.000054 ====== Token probability statistics ====== Mean Δp: -0.005 ± 0.006 % Maximum Δp: 99.594% 99.9% Δp: 15.232% 99.0% Δp: 4.091% 95.0% Δp: 2.066% 90.0% Δp: 1.186% 75.0% Δp: 0.214% Median Δp: -0.000% 25.0% Δp: -0.236% 10.0% Δp: -1.229% 5.0% Δp: -2.097% 1.0% Δp: -4.163% 0.1% Δp: -12.016% Minimum Δp: -99.923% RMS Δp : 2.340 ± 0.080 % Same top p: 97.426 ± 0.041 % 

Qwen3.6-27B-Q8-CC.gguf

32726111136 (30.47GiB) Qwen3.6-27B-Q8-CC.gguf

Note that PPL seems worse here but token probability and KL divergence seem better.

====== Perplexity statistics ====== Mean PPL(Q) : 6.681999 ± 0.045554 Mean PPL(base) : 6.636486 ± 0.044736 Cor(ln(PPL(Q)), ln(PPL(base))): 99.49% Mean ln(PPL(Q)/PPL(base)) : 0.006835 ± 0.000688 Mean PPL(Q)/PPL(base) : 1.006858 ± 0.000693 Mean PPL(Q)-PPL(base) : 0.045513 ± 0.004626 ====== KL divergence statistics ====== Mean KLD: 0.011324 ± 0.000790 Maximum KLD: 24.220026 99.9% KLD: 2.506243 99.0% KLD: 0.023331 95.0% KLD: 0.003847 90.0% KLD: 0.002324 Median KLD: 0.000499 10.0% KLD: 0.000004 5.0% KLD: 0.000001 1.0% KLD: -0.000001 0.1% KLD: -0.000010 Minimum KLD: -0.000112 ====== Token probability statistics ====== Mean Δp: -0.027 ± 0.006 % Maximum Δp: 99.801% 99.9% Δp: 13.591% 99.0% Δp: 3.079% 95.0% Δp: 1.560% 90.0% Δp: 0.686% 75.0% Δp: 0.077% Median Δp: 0.000% 25.0% Δp: -0.084% 10.0% Δp: -0.770% 5.0% Δp: -1.682% 1.0% Δp: -3.208% 0.1% Δp: -16.596% Minimum Δp: -99.918% RMS Δp : 2.305 ± 0.084 % Same top p: 98.358 ± 0.033 % 

For extra points, here's another quant that's still smaller than UD Q8 K XL and performs better on multiple metrics.

Qwen3.6-27B-Q8-CC-5.gguf

35144389536 (32.73GB) Qwen3.6-27B-Q8-CC-5.gguf

====== Perplexity statistics ====== Mean PPL(Q) : 6.670677 ± 0.045414 Mean PPL(base) : 6.636486 ± 0.044736 Cor(ln(PPL(Q)), ln(PPL(base))): 99.59% Mean ln(PPL(Q)/PPL(base)) : 0.005139 ± 0.000618 Mean PPL(Q)/PPL(base) : 1.005152 ± 0.000621 Mean PPL(Q)-PPL(base) : 0.034192 ± 0.004145 ====== KL divergence statistics ====== Mean KLD: 0.010970 ± 0.000828 Maximum KLD: 25.486208 99.9% KLD: 1.975405 99.0% KLD: 0.021026 95.0% KLD: 0.003457 90.0% KLD: 0.002151 Median KLD: 0.000438 10.0% KLD: 0.000003 5.0% KLD: 0.000001 1.0% KLD: -0.000002 0.1% KLD: -0.000011 Minimum KLD: -0.000480 ====== Token probability statistics ====== Mean Δp: -0.020 ± 0.006 % Maximum Δp: 99.828% 99.9% Δp: 13.630% 99.0% Δp: 3.038% 95.0% Δp: 1.474% 90.0% Δp: 0.643% 75.0% Δp: 0.072% Median Δp: 0.000% 25.0% Δp: -0.073% 10.0% Δp: -0.714% 5.0% Δp: -1.669% 1.0% Δp: -3.113% 0.1% Δp: -12.475% Minimum Δp: -99.916% RMS Δp : 2.201 ± 0.084 % Same top p: 98.453 ± 0.032 % 

And here's the recipe for CC-5

/home/user/llm/llama.cpp/build/bin/llama-quantize \ --token-embedding-type bf16 \ --tensor-type output_norm=bf16 \ --tensor-type attn_k=bf16 \ --tensor-type post_attention_norm=bf16 \ --tensor-type attn_q_norm=bf16 \ --tensor-type attn_k_norm=bf16 \ --tensor-type attn_norm=bf16 \ --tensor-type ssm_a=bf16 \ --tensor-type ssm_alpha=bf16 \ --tensor-type ssm_beta=bf16 \ --tensor-type ssm_conv1d=bf16 \ --tensor-type ssm_dt.bias=bf16 \ --tensor-type ssm_norm=bf16 \ --tensor-type nextn.eh_proj=bf16 \ --tensor-type blk.34.attn_gate=bf16 \ --tensor-type blk.6.attn_gate=bf16 \ --tensor-type blk.18.attn_gate=bf16 \ --tensor-type blk.37.attn_gate=bf16 \ --tensor-type blk.4.attn_gate=bf16 \ --tensor-type blk.5.attn_gate=bf16 \ --tensor-type blk.1.attn_gate=bf16 \ --tensor-type blk.0.attn_gate=bf16 \ --tensor-type blk.40.attn_gate=bf16 \ --tensor-type blk.2.attn_gate=bf16 \ --tensor-type blk.10.attn_gate=bf16 \ --tensor-type blk.8.attn_gate=bf16 \ --tensor-type blk.9.attn_gate=bf16 \ --tensor-type blk.16.attn_gate=bf16 \ --tensor-type blk.11.attn_q=bf16 \ --tensor-type blk.63.attn_q=bf16 \ --tensor-type blk.27.attn_q=bf16 \ --tensor-type blk.43.attn_q=bf16 \ --tensor-type blk.59.attn_q=bf16 \ --tensor-type blk.47.attn_q=bf16 \ --tensor-type blk.51.attn_q=bf16 \ --tensor-type blk.3.attn_q=bf16 \ --tensor-type blk.7.attn_q=bf16 \ --tensor-type blk.35.attn_q=bf16 \ --tensor-type blk.0.attn_qkv=bf16 \ --tensor-type blk.37.attn_qkv=bf16 \ --tensor-type blk.28.attn_qkv=bf16 \ --tensor-type blk.40.attn_qkv=bf16 \ --tensor-type blk.32.attn_qkv=bf16 \ --tensor-type blk.36.attn_qkv=bf16 \ --tensor-type blk.33.attn_qkv=bf16 \ --tensor-type blk.34.attn_qkv=bf16 \ --tensor-type blk.30.attn_qkv=bf16 \ --tensor-type blk.63.attn_v=bf16 \ --tensor-type blk.59.attn_v=bf16 \ --tensor-type blk.51.attn_v=bf16 \ --tensor-type blk.55.attn_v=bf16 \ --tensor-type blk.35.attn_v=bf16 \ --tensor-type blk.43.attn_v=bf16 \ --tensor-type blk.19.attn_v=bf16 \ --tensor-type blk.47.attn_v=bf16 \ --tensor-type blk.27.attn_v=bf16 \ --tensor-type blk.39.attn_v=bf16 \ --tensor-type blk.37.ssm_out=bf16 \ --tensor-type blk.0.ssm_out=bf16 \ --tensor-type blk.34.ssm_out=bf16 \ --tensor-type blk.2.ssm_out=bf16 \ --tensor-type blk.18.ssm_out=bf16 \ --tensor-type blk.6.ssm_out=bf16 \ --tensor-type blk.21.ssm_out=bf16 \ --tensor-type blk.1.ssm_out=bf16 \ --tensor-type blk.30.ssm_out=bf16 \ --tensor-type blk.26.ssm_out=bf16 \ --tensor-type blk.4.ssm_out=bf16 \ --tensor-type blk.10.ssm_out=bf16 \ --tensor-type blk.5.ssm_out=bf16 \ --tensor-type blk.14.ssm_out=bf16 \ --tensor-type blk.25.ssm_out=bf16 \ --tensor-type blk.12.ssm_out=bf16 \ --tensor-type blk.8.ssm_out=bf16 \ --tensor-type blk.28.ssm_out=bf16 \ --tensor-type blk.9.ssm_out=bf16 \ --tensor-type blk.63.ffn_up=bf16 \ --tensor-type blk.62.ffn_up=bf16 \ --tensor-type blk.61.ffn_up=bf16 \ --tensor-type blk.22.ffn_up=bf16 \ --tensor-type blk.63.ffn_gate=bf16 \ --tensor-type blk.50.ffn_gate=bf16 \ --tensor-type blk.49.ffn_gate=bf16 \ --tensor-type blk.34.ffn_gate=bf16 \ --tensor-type blk.61.ffn_gate=bf16 \ --tensor-type blk.62.ffn_gate=bf16 \ --tensor-type blk.6.ffn_down=bf16 \ --tensor-type blk.64.ffn_down=bf16 \ --tensor-type blk.22.ffn_down=bf16 \ --tensor-type blk.18.ffn_down=bf16 \ --tensor-type blk.63.ffn_down=bf16 \ --tensor-type blk.0.ffn_down=bf16 \ --tensor-type blk.1.ffn_down=bf16 \ --tensor-type blk.62.ffn_down=bf16 \ --output-tensor-type bf16 \ /home/user/llm/models/Qwen3.6-27B/Qwen3.6-27B-BF16-00001-of-00002.gguf \ /home/user/llm/models/Qwen3.6-27B/Qwen3.6-27B-Q8-CC-5.gguf \ q8_0 

Q8 K XL vs CC-5:

https://preview.redd.it/fkkmks72wa5h1.png?width=585&format=png&auto=webp&s=b37a2c2c75687e61c13753700f4b42dbf6d3282c

Metric Qwen3.6-27B-UD-Q8_K_XL Qwen3.6-27B-Q8-CC-5
Mean KLD 0.012100 ± 0.000836 0.010970 ± 0.000828
Maximum KLD 24.382509 25.486208
99.9% KLD 2.473664 1.975405
99.0% KLD 0.024188 0.021026
95.0% KLD 0.005269 0.003457
90.0% KLD 0.003549 0.002151
Median KLD 0.000954 0.000438
10.0% KLD 0.000009 0.000003
5.0% KLD 0.000002 0.000001
1.0% KLD -0.000001 -0.000002
0.1% KLD -0.000007 -0.000011
Minimum KLD -0.000054 -0.00048
Metric Qwen3.6-27B-UD-Q8_K_XL Qwen3.6-27B-Q8-CC-5
Mean Δp -0.005% ± 0.006% -0.020% ± 0.006%
Maximum Δp 99.59% 99.83%
99.9% Δp 15.23% 13.63%
99.0% Δp 4.09% 3.04%
95.0% Δp 2.07% 1.47%
90.0% Δp 1.19% 0.64%
75.0% Δp 0.21% 0.07%
Median Δp 0.00% 0.00%
25.0% Δp -0.24% -0.07%
10.0% Δp -1.23% -0.71%
5.0% Δp -2.10% -1.67%
1.0% Δp -4.16% -3.11%
0.1% Δp -12.02% -12.48%
Minimum Δp -99.92% -99.92%
RMS Δp 2.340% ± 0.080% 2.201% ± 0.084%
Same top p 97.426% ± 0.041% 98.453% ± 0.032%
submitted by /u/fragment_me
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA