r/LocalLLaMA · · 3 min read

KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how it interacts with noise from the model itself.

All benchmarks were done using my BeeLlama.cpp fork, allowing to include a number of quant types that are not present in mainline llama.cpp: vanilla TurboQuant, TCQ 3-bit/2-bit, and q6_0.

https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context

TL;DR

  • q5_0 KV is underrated, and same for q5_1 as V cache. Both really don't get the attention they deserve. Data shows they provide solid mid-range performance without being as heavy as q8_0 nor as shitty as q4_0.
  • q8_0 / q4_* is overrated. Strong K does not fully rescue weak V, and those pairs are too unbalanced and perform worse than the community reputation suggests.
  • Prefer sane KV quants over wasting VRAM on bf16 cache for heavily quantized weights. A Q4/IQ4 model with full bf16 KV looks like the wrong trade to me, and both draw from the same VRAM pool so you might want to balance them better.
  • Practical ladder: q8_0 / q6_0 or q8_0 / q5_1 for high-end, q6_0 / q5_0 for extra headroom, q5_0 / q5_0 or q5_0 / q4_1 when VRAM is tight, q4_0 / q4_0 only if no other options allow to fit the desired context.
  • TurboQuant is confirmed to be useful only as extreme compression. turbo3_tcq is the only type with decent quality per size, turbo4 is basically useless while also being slow.

KLD results on Q5_K_S + 64k context

The rest of benchmark data and in-depth analysis are available in the article.

Cache Size Mean KLD Mean precision 99.9% KLD 99.9% precision Tok/s
bf16 100.0% 0.000375 100.00% 0.023258 100.00% 850.81
q8_0 53.1% 0.002328 99.80% 0.078709 94.61% 851.11
q8_0-q6_0 46.9% 0.002499 99.79% 0.081616 94.33% 848.78
q8_0-q5_1 45.3% 0.002529 99.78% 0.082880 94.21% 828.63
q8_0-q5_0 43.8% 0.002656 99.77% 0.088486 93.69% 847.33
q8_0-q4_1 42.2% 0.003080 99.73% 0.099080 92.70% 786.54
q8_0-q4_0 40.6% 0.003316 99.71% 0.104680 92.18% 849.37
q6_0 40.6% 0.002614 99.78% 0.090800 93.47% 845.96
q8_0-turbo4 39.5% 0.003561 99.68% 0.103041 92.33% 838.90
q6_0-q5_1 39.1% 0.002781 99.76% 0.090447 93.50% 846.24
q5_1 37.5% 0.002911 99.75% 0.098354 92.77% 841.65
q6_0-q5_0 37.5% 0.002820 99.76% 0.092682 93.29% 846.86
q8_0-turbo3_tcq 36.7% 0.005090 99.53% 0.149387 88.15% 817.57
q6_0-q4_1 35.9% 0.003312 99.71% 0.104582 92.19% 848.42
q5_0 34.4% 0.003206 99.72% 0.099073 92.70% 849.79
q5_1-q4_1 34.4% 0.003380 99.70% 0.095011 93.08% 846.27
q6_0-q4_0 34.4% 0.003288 99.71% 0.111566 91.55% 848.24
q6_0-turbo4 33.2% 0.003748 99.66% 0.107377 91.93% 837.77
q5_0-q4_1 32.8% 0.003471 99.69% 0.099618 92.65% 847.59
q5_1-q4_0 32.8% 0.003626 99.68% 0.108649 91.82% 846.91
q4_1 31.3% 0.004476 99.59% 0.141813 88.82% 854.33
q5_0-q4_0 31.3% 0.003581 99.68% 0.113332 91.39% 847.64
q6_0-turbo3_tcq 30.5% 0.005379 99.50% 0.154680 87.68% 819.23
q5_0-turbo4 30.1% 0.003812 99.66% 0.112249 91.49% 837.52
q5_1-turbo3_tcq 28.9% 0.005594 99.48% 0.144591 88.57% 816.05
q4_0 28.1% 0.004711 99.57% 0.130419 89.84% 855.08
q5_0-turbo3_tcq 27.3% 0.005471 99.49% 0.158514 87.35% 815.80
q5_0-turbo3 27.0% 0.007097 99.33% 0.192428 84.44% 837.90
q4_1-turbo3_tcq 25.8% 0.006184 99.42% 0.174831 85.94% 816.95
turbo4 25.8% 0.004760 99.55% 0.138370 89.13% 705.32
q4_0-turbo3_tcq 24.2% 0.006269 99.41% 0.186572 84.93% 821.89
q4_0-turbo3 23.8% 0.008235 99.22% 0.222154 81.96% 839.29
q4_0-turbo2_tcq 21.1% 0.015168 98.53% 0.395244 68.94% 826.07
turbo3_tcq 20.3% 0.007978 99.24% 0.227104 81.56% 795.20
turbo3 19.5% 0.011181 98.93% 0.296060 76.12% 836.75
turbo3_tcq-turbo2_tcq 17.2% 0.016386 98.41% 0.437043 66.11% 796.16
turbo3-turbo2 16.4% 0.023985 97.67% 0.605087 55.89% 831.88
turbo2_tcq 14.1% 0.023073 97.76% 0.632401 54.38% 807.25
turbo2 13.3% 0.036230 96.48% 0.903576 41.47% 842.29
submitted by /u/Anbeeld
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA