KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive!
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
TL;DR Based on long context KLD benchmarks, KVarN appears to be just better than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher.
A number of people in the comments under my previous post asked a fair question: what if we drop the obsession with 2-bit and 3-bit toy quants and apply KVarN to high end? So I did just that in my latest BeeLlama v0.3.2 Preview (fork of llama.cpp with DFlash, in short) and ran the same benchmarks as I previously did for basically all the KV cache quant pairs, allowing for a thorough analysis.
Note that current v0.3.2 release binaries are stale with CI/CD ongoing, build it from source!
And it appears that the initial "punch one tier higher than its weight" principle holds up for 5-bit, 6-bit and 8-bit KVarN as well, which is honestly just great news! This means you can match q8_0 while only paying for 6-bit memory, or even 5.5-bit by going for 6/5 combo with minimal losses. But there's also good quality at just 4-bit or asymmetrical 5/4-bit pairs. Massive for VRAM-constrained setups!
Prompt processing is slower for now, but I'm not claiming it as inevitable yet. The implementation is very much raw and likely might be optimized further.
KLD results on Qwen 3.6 27B Q5_K_S + 64k context
The rest of benchmark data and in-depth analysis are available in the article.
| Cache | Size | Mean KLD | Mean precision | 99.9% KLD | 99.9% precision | Tok/s |
|---|---|---|---|---|---|---|
| bf16 | 100.0% | 0.000375 | 100.00% | 0.023258 | 100.00% | 850.81 |
| kvarn8-kvarn8 | 52.9% | 0.002361 | 99.80% | 0.076809 | 94.79% | 634.12 |
| q8_0 | 53.1% | 0.002328 | 99.80% | 0.078709 | 94.61% | 851.11 |
| kvarn8-kvarn6 | 46.7% | 0.002390 | 99.80% | 0.082415 | 94.26% | 643.46 |
| kvarn8-kvarn5 | 43.6% | 0.002266 | 99.81% | 0.084573 | 94.05% | 646.63 |
| kvarn6-kvarn6 | 40.4% | 0.002338 | 99.80% | 0.078797 | 94.60% | 689.31 |
| q8_0-q5_1 | 45.3% | 0.002529 | 99.78% | 0.082880 | 94.21% | 828.63 |
| kvarn8-kvarn4 | 40.4% | 0.002533 | 99.78% | 0.086218 | 93.90% | 645.67 |
| q8_0-q4_0 | 40.6% | 0.003316 | 99.71% | 0.104680 | 92.18% | 849.37 |
| q6_0 | 40.6% | 0.002614 | 99.78% | 0.090800 | 93.47% | 845.96 |
| kvarn6-kvarn5 | 37.3% | 0.002602 | 99.78% | 0.079818 | 94.50% | 692.77 |
| kvarn8-kvarn3 | 37.3% | 0.003529 | 99.69% | 0.121564 | 90.64% | 649.84 |
| kvarn5-kvarn5 | 34.2% | 0.002705 | 99.77% | 0.083457 | 94.16% | 699.80 |
| kvarn6-kvarn4 | 34.2% | 0.002831 | 99.75% | 0.091507 | 93.40% | 694.79 |
| kvarn8-kvarn2 | 34.2% | 0.009494 | 99.09% | 0.325652 | 73.90% | 651.45 |
| q6_0-q5_0 | 37.5% | 0.002820 | 99.76% | 0.092682 | 93.29% | 846.86 |
| q5_1 | 37.5% | 0.002911 | 99.75% | 0.098354 | 92.77% | 841.65 |
| q5_0 | 34.4% | 0.003206 | 99.72% | 0.099073 | 92.70% | 849.79 |
| kvarn5-kvarn4 | 31.1% | 0.002824 | 99.76% | 0.093313 | 93.23% | 700.73 |
| kvarn6-kvarn3 | 31.1% | 0.003533 | 99.68% | 0.123369 | 90.47% | 697.01 |
| q5_0-q4_0 | 31.3% | 0.003581 | 99.68% | 0.113332 | 91.39% | 847.64 |
| kvarn5-kvarn3 | 27.9% | 0.003515 | 99.69% | 0.118848 | 90.88% | 701.67 |
| kvarn6-kvarn2 | 27.9% | 0.009301 | 99.11% | 0.310819 | 75.01% | 697.56 |
| q4_0 | 28.1% | 0.004711 | 99.57% | 0.130419 | 89.84% | 855.08 |
| kvarn4-kvarn4 | 27.9% | 0.002974 | 99.74% | 0.094819 | 93.09% | 760.88 |
| kvarn5-kvarn2 | 24.8% | 0.009813 | 99.06% | 0.344122 | 72.55% | 705.26 |
| q5_0-turbo3_tcq | 27.3% | 0.005471 | 99.49% | 0.158514 | 87.35% | 815.80 |
| turbo4 | 25.8% | 0.004760 | 99.55% | 0.138370 | 89.13% | 705.32 |
| kvarn4-kvarn3 | 24.8% | 0.003824 | 99.66% | 0.135028 | 89.42% | 765.23 |
| kvarn3-kvarn4 | 24.8% | 0.004652 | 99.57% | 0.140358 | 88.95% | 770.52 |
| q4_0-turbo3_tcq | 24.2% | 0.006269 | 99.41% | 0.186572 | 84.93% | 821.89 |
| kvarn4-kvarn2 | 21.7% | 0.010449 | 99.00% | 0.340392 | 72.82% | 765.57 |
| kvarn3-kvarn3 | 21.7% | 0.005349 | 99.50% | 0.168135 | 86.51% | 773.12 |
| kvarn2-kvarn4 | 21.7% | 0.013639 | 98.68% | 0.418240 | 67.37% | 771.78 |
| turbo3_tcq | 20.3% | 0.007978 | 99.24% | 0.227104 | 81.56% | 795.20 |
| kvarn3-kvarn2 | 18.6% | 0.011122 | 98.93% | 0.345995 | 72.42% | 773.65 |
| kvarn2-kvarn3 | 18.6% | 0.014589 | 98.59% | 0.445014 | 65.59% | 773.83 |
| kvarn2-kvarn2 | 15.4% | 0.021395 | 97.92% | 0.630208 | 54.50% | 776.81 |
| turbo2_tcq | 14.1% | 0.023073 | 97.76% | 0.632401 | 54.38% | 807.25 |
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.