I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Saw this post here yesterday: KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)
Cheap KV cache with good precision? Sign me up! Oh, vLLM only...
Wait, I do have my own llama.cpp fork, and I do have an extensive reference for KLD benchmarking. I should act!
And so I acted. Until 6 am.
So now KVarN is implemented in a publicly available BeeLlama.cpp v0.3.2 Preview, and you can literally just try it yourself: download a prebuilt, launch it with --cache-type-k kvarn4 and --cache-type-v kvarn4 or whatever bits you want, enjoy the ride. If it works on your platform, because I only have RTX 3090 for testing. Qwen 3.6 27B and Gemma 4 31B are supported for sure, and their little bros will probably work too.
And here comes the more important question, which is should you try it? The original paper says "we've got fp16 in k4v2". Yeah, sure... Maybe in some benchmarks... But how it holds up in general?
To answer this question, I booted up the good old KLD and started comparing KVarN to my collection of 50-something quant pairs. As usual, we don't look at PPL and other pathetic metrics, we check median and 99.9% KLD over 3 different configs of Qwen 3.6 27B.
And it's not that bad. I mean, compared to the infamous TurboQuant. KVarN actually appears to be punching above it's weight even compared to rotation-enabled llama.cpp quants. Not by much, but we VRAM-constrained folks are happy for every 0.1% of precision.
TL;DR is that it delivers q5 quality at 4-bit, and q4 quality at 3.5-bit. And that's on a very raw implementation. Probably can improved further. Especially speed. For speed I'm not claiming anything at all, it's really is just too raw to compare it. But the mature implementation in paper had it faster than usual quants.
Is it fp16 quality? No. Is it still better than like anything else in llama.cpp ecosystem? Look like yes.
KLD results on Q5_K_S + 64k context
The rest of benchmark data and in-depth analysis are available in the article.
| Cache | Size | Mean KLD | Mean precision | 99.9% KLD | 99.9% precision | Tok/s |
|---|---|---|---|---|---|---|
| bf16 | 100.0% | 0.000375 | 100.00% | 0.023258 | 100.00% | 850.81 |
| q8_0 | 53.1% | 0.002328 | 99.80% | 0.078709 | 94.61% | 851.11 |
| q8_0-q5_1 | 45.3% | 0.002529 | 99.78% | 0.082880 | 94.21% | 828.63 |
| q8_0-q4_0 | 40.6% | 0.003316 | 99.71% | 0.104680 | 92.18% | 849.37 |
| q6_0 | 40.6% | 0.002614 | 99.78% | 0.090800 | 93.47% | 845.96 |
| q6_0-q5_0 | 37.5% | 0.002820 | 99.76% | 0.092682 | 93.29% | 846.86 |
| q5_1 | 37.5% | 0.002911 | 99.75% | 0.098354 | 92.77% | 841.65 |
| q5_0 | 34.4% | 0.003206 | 99.72% | 0.099073 | 92.70% | 849.79 |
| q5_0-q4_0 | 31.3% | 0.003581 | 99.68% | 0.113332 | 91.39% | 847.64 |
| q4_0 | 28.1% | 0.004711 | 99.57% | 0.130419 | 89.84% | 855.08 |
| kvarn4-kvarn4 | 27.9% | 0.002974 | 99.74% | 0.094819 | 93.09% | 760.88 |
| q5_0-turbo3_tcq | 27.3% | 0.005471 | 99.49% | 0.158514 | 87.35% | 815.80 |
| turbo4 | 25.8% | 0.004760 | 99.55% | 0.138370 | 89.13% | 705.32 |
| kvarn4-kvarn3 | 24.8% | 0.003824 | 99.66% | 0.135028 | 89.42% | 765.23 |
| q4_0-turbo3_tcq | 24.2% | 0.006269 | 99.41% | 0.186572 | 84.93% | 821.89 |
| kvarn4-kvarn2 | 21.7% | 0.010449 | 99.00% | 0.340392 | 72.82% | 765.57 |
| kvarn3-kvarn3 | 21.7% | 0.005349 | 99.50% | 0.168135 | 86.51% | 773.12 |
| turbo3_tcq | 20.3% | 0.007978 | 99.24% | 0.227104 | 81.56% | 795.20 |
| kvarn3-kvarn2 | 18.6% | 0.011122 | 98.93% | 0.345995 | 72.42% | 773.65 |
| kvarn2-kvarn2 | 15.4% | 0.021395 | 97.92% | 0.630208 | 54.50% | 776.81 |
| turbo2_tcq | 14.1% | 0.023073 | 97.76% | 0.632401 | 54.38% | 807.25 |
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.