r/LocalLLaMA · · 3 min read

KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive!

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

TL;DR Based on long context KLD benchmarks, KVarN appears to be just better than usual llama.cpp KV cache quants. At every size, KVarN matches precision of usual quants of one bit higher.

A number of people in the comments under my previous post asked a fair question: what if we drop the obsession with 2-bit and 3-bit toy quants and apply KVarN to high end? So I did just that in my latest BeeLlama v0.3.2 Preview (fork of llama.cpp with DFlash, in short) and ran the same benchmarks as I previously did for basically all the KV cache quant pairs, allowing for a thorough analysis.

Note that current v0.3.2 release binaries are stale with CI/CD ongoing, build it from source!

And it appears that the initial "punch one tier higher than its weight" principle holds up for 5-bit, 6-bit and 8-bit KVarN as well, which is honestly just great news! This means you can match q8_0 while only paying for 6-bit memory, or even 5.5-bit by going for 6/5 combo with minimal losses. But there's also good quality at just 4-bit or asymmetrical 5/4-bit pairs. Massive for VRAM-constrained setups!

Prompt processing is slower for now, but I'm not claiming it as inevitable yet. The implementation is very much raw and likely might be optimized further.

KLD results on Qwen 3.6 27B Q5_K_S + 64k context

The rest of benchmark data and in-depth analysis are available in the article.

Cache Size Mean KLD Mean precision 99.9% KLD 99.9% precision Tok/s
bf16 100.0% 0.000375 100.00% 0.023258 100.00% 850.81
kvarn8-kvarn8 52.9% 0.002361 99.80% 0.076809 94.79% 634.12
q8_0 53.1% 0.002328 99.80% 0.078709 94.61% 851.11
kvarn8-kvarn6 46.7% 0.002390 99.80% 0.082415 94.26% 643.46
kvarn8-kvarn5 43.6% 0.002266 99.81% 0.084573 94.05% 646.63
kvarn6-kvarn6 40.4% 0.002338 99.80% 0.078797 94.60% 689.31
q8_0-q5_1 45.3% 0.002529 99.78% 0.082880 94.21% 828.63
kvarn8-kvarn4 40.4% 0.002533 99.78% 0.086218 93.90% 645.67
q8_0-q4_0 40.6% 0.003316 99.71% 0.104680 92.18% 849.37
q6_0 40.6% 0.002614 99.78% 0.090800 93.47% 845.96
kvarn6-kvarn5 37.3% 0.002602 99.78% 0.079818 94.50% 692.77
kvarn8-kvarn3 37.3% 0.003529 99.69% 0.121564 90.64% 649.84
kvarn5-kvarn5 34.2% 0.002705 99.77% 0.083457 94.16% 699.80
kvarn6-kvarn4 34.2% 0.002831 99.75% 0.091507 93.40% 694.79
kvarn8-kvarn2 34.2% 0.009494 99.09% 0.325652 73.90% 651.45
q6_0-q5_0 37.5% 0.002820 99.76% 0.092682 93.29% 846.86
q5_1 37.5% 0.002911 99.75% 0.098354 92.77% 841.65
q5_0 34.4% 0.003206 99.72% 0.099073 92.70% 849.79
kvarn5-kvarn4 31.1% 0.002824 99.76% 0.093313 93.23% 700.73
kvarn6-kvarn3 31.1% 0.003533 99.68% 0.123369 90.47% 697.01
q5_0-q4_0 31.3% 0.003581 99.68% 0.113332 91.39% 847.64
kvarn5-kvarn3 27.9% 0.003515 99.69% 0.118848 90.88% 701.67
kvarn6-kvarn2 27.9% 0.009301 99.11% 0.310819 75.01% 697.56
q4_0 28.1% 0.004711 99.57% 0.130419 89.84% 855.08
kvarn4-kvarn4 27.9% 0.002974 99.74% 0.094819 93.09% 760.88
kvarn5-kvarn2 24.8% 0.009813 99.06% 0.344122 72.55% 705.26
q5_0-turbo3_tcq 27.3% 0.005471 99.49% 0.158514 87.35% 815.80
turbo4 25.8% 0.004760 99.55% 0.138370 89.13% 705.32
kvarn4-kvarn3 24.8% 0.003824 99.66% 0.135028 89.42% 765.23
kvarn3-kvarn4 24.8% 0.004652 99.57% 0.140358 88.95% 770.52
q4_0-turbo3_tcq 24.2% 0.006269 99.41% 0.186572 84.93% 821.89
kvarn4-kvarn2 21.7% 0.010449 99.00% 0.340392 72.82% 765.57
kvarn3-kvarn3 21.7% 0.005349 99.50% 0.168135 86.51% 773.12
kvarn2-kvarn4 21.7% 0.013639 98.68% 0.418240 67.37% 771.78
turbo3_tcq 20.3% 0.007978 99.24% 0.227104 81.56% 795.20
kvarn3-kvarn2 18.6% 0.011122 98.93% 0.345995 72.42% 773.65
kvarn2-kvarn3 18.6% 0.014589 98.59% 0.445014 65.59% 773.83
kvarn2-kvarn2 15.4% 0.021395 97.92% 0.630208 54.50% 776.81
turbo2_tcq 14.1% 0.023073 97.76% 0.632401 54.38% 807.25
submitted by /u/Anbeeld
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA