KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Here's my article with 38 quant pairs thoroughly benchmarked in KLD with 3 different Qwen 3.6 27B configs: Q5_K_S + 64k context, IQ4_XS + 64k context, IQ4_XS + 128k context. This allows us to track not only how cache quantizations affects the precision in a vacuum, but also how it interacts with noise from the model itself.
All benchmarks were done using my BeeLlama.cpp fork, allowing to include a number of quant types that are not present in mainline llama.cpp: vanilla TurboQuant, TCQ 3-bit/2-bit, and q6_0.
https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context
TL;DR
q5_0KV is underrated, and same forq5_1as V cache. Both really don't get the attention they deserve. Data shows they provide solid mid-range performance without being as heavy asq8_0nor as shitty asq4_0.q8_0 / q4_*is overrated. Strong K does not fully rescue weak V, and those pairs are too unbalanced and perform worse than the community reputation suggests.- Prefer sane KV quants over wasting VRAM on
bf16cache for heavily quantized weights. AQ4/IQ4model with fullbf16KV looks like the wrong trade to me, and both draw from the same VRAM pool so you might want to balance them better. - Practical ladder:
q8_0 / q6_0orq8_0 / q5_1for high-end,q6_0 / q5_0for extra headroom,q5_0 / q5_0orq5_0 / q4_1when VRAM is tight,q4_0 / q4_0only if no other options allow to fit the desired context. - TurboQuant is confirmed to be useful only as extreme compression.
turbo3_tcqis the only type with decent quality per size,turbo4is basically useless while also being slow.
KLD results on Q5_K_S + 64k context
The rest of benchmark data and in-depth analysis are available in the article.
| Cache | Size | Mean KLD | Mean precision | 99.9% KLD | 99.9% precision | Tok/s |
|---|---|---|---|---|---|---|
| bf16 | 100.0% | 0.000375 | 100.00% | 0.023258 | 100.00% | 850.81 |
| q8_0 | 53.1% | 0.002328 | 99.80% | 0.078709 | 94.61% | 851.11 |
| q8_0-q6_0 | 46.9% | 0.002499 | 99.79% | 0.081616 | 94.33% | 848.78 |
| q8_0-q5_1 | 45.3% | 0.002529 | 99.78% | 0.082880 | 94.21% | 828.63 |
| q8_0-q5_0 | 43.8% | 0.002656 | 99.77% | 0.088486 | 93.69% | 847.33 |
| q8_0-q4_1 | 42.2% | 0.003080 | 99.73% | 0.099080 | 92.70% | 786.54 |
| q8_0-q4_0 | 40.6% | 0.003316 | 99.71% | 0.104680 | 92.18% | 849.37 |
| q6_0 | 40.6% | 0.002614 | 99.78% | 0.090800 | 93.47% | 845.96 |
| q8_0-turbo4 | 39.5% | 0.003561 | 99.68% | 0.103041 | 92.33% | 838.90 |
| q6_0-q5_1 | 39.1% | 0.002781 | 99.76% | 0.090447 | 93.50% | 846.24 |
| q5_1 | 37.5% | 0.002911 | 99.75% | 0.098354 | 92.77% | 841.65 |
| q6_0-q5_0 | 37.5% | 0.002820 | 99.76% | 0.092682 | 93.29% | 846.86 |
| q8_0-turbo3_tcq | 36.7% | 0.005090 | 99.53% | 0.149387 | 88.15% | 817.57 |
| q6_0-q4_1 | 35.9% | 0.003312 | 99.71% | 0.104582 | 92.19% | 848.42 |
| q5_0 | 34.4% | 0.003206 | 99.72% | 0.099073 | 92.70% | 849.79 |
| q5_1-q4_1 | 34.4% | 0.003380 | 99.70% | 0.095011 | 93.08% | 846.27 |
| q6_0-q4_0 | 34.4% | 0.003288 | 99.71% | 0.111566 | 91.55% | 848.24 |
| q6_0-turbo4 | 33.2% | 0.003748 | 99.66% | 0.107377 | 91.93% | 837.77 |
| q5_0-q4_1 | 32.8% | 0.003471 | 99.69% | 0.099618 | 92.65% | 847.59 |
| q5_1-q4_0 | 32.8% | 0.003626 | 99.68% | 0.108649 | 91.82% | 846.91 |
| q4_1 | 31.3% | 0.004476 | 99.59% | 0.141813 | 88.82% | 854.33 |
| q5_0-q4_0 | 31.3% | 0.003581 | 99.68% | 0.113332 | 91.39% | 847.64 |
| q6_0-turbo3_tcq | 30.5% | 0.005379 | 99.50% | 0.154680 | 87.68% | 819.23 |
| q5_0-turbo4 | 30.1% | 0.003812 | 99.66% | 0.112249 | 91.49% | 837.52 |
| q5_1-turbo3_tcq | 28.9% | 0.005594 | 99.48% | 0.144591 | 88.57% | 816.05 |
| q4_0 | 28.1% | 0.004711 | 99.57% | 0.130419 | 89.84% | 855.08 |
| q5_0-turbo3_tcq | 27.3% | 0.005471 | 99.49% | 0.158514 | 87.35% | 815.80 |
| q5_0-turbo3 | 27.0% | 0.007097 | 99.33% | 0.192428 | 84.44% | 837.90 |
| q4_1-turbo3_tcq | 25.8% | 0.006184 | 99.42% | 0.174831 | 85.94% | 816.95 |
| turbo4 | 25.8% | 0.004760 | 99.55% | 0.138370 | 89.13% | 705.32 |
| q4_0-turbo3_tcq | 24.2% | 0.006269 | 99.41% | 0.186572 | 84.93% | 821.89 |
| q4_0-turbo3 | 23.8% | 0.008235 | 99.22% | 0.222154 | 81.96% | 839.29 |
| q4_0-turbo2_tcq | 21.1% | 0.015168 | 98.53% | 0.395244 | 68.94% | 826.07 |
| turbo3_tcq | 20.3% | 0.007978 | 99.24% | 0.227104 | 81.56% | 795.20 |
| turbo3 | 19.5% | 0.011181 | 98.93% | 0.296060 | 76.12% | 836.75 |
| turbo3_tcq-turbo2_tcq | 17.2% | 0.016386 | 98.41% | 0.437043 | 66.11% | 796.16 |
| turbo3-turbo2 | 16.4% | 0.023985 | 97.67% | 0.605087 | 55.89% | 831.88 |
| turbo2_tcq | 14.1% | 0.023073 | 97.76% | 0.632401 | 54.38% | 807.25 |
| turbo2 | 13.3% | 0.036230 | 96.48% | 0.903576 | 41.47% | 842.29 |
[link] [comments]
More from r/LocalLLaMA
-
Why are the AI Companies spreading F.U.D. about AI?
May 27
-
Q4_K_M is fine for chat and a trap for agents. Here is math mathing.
May 27
-
I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned
May 27
-
Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes
May 27
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.