r/LocalLLaMA · · 3 min read

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Saw this post here yesterday: KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

Cheap KV cache with good precision? Sign me up! Oh, vLLM only...

Wait, I do have my own llama.cpp fork, and I do have an extensive reference for KLD benchmarking. I should act!

And so I acted. Until 6 am.

So now KVarN is implemented in a publicly available BeeLlama.cpp v0.3.2 Preview, and you can literally just try it yourself: download a prebuilt, launch it with --cache-type-k kvarn4 and --cache-type-v kvarn4 or whatever bits you want, enjoy the ride. If it works on your platform, because I only have RTX 3090 for testing. Qwen 3.6 27B and Gemma 4 31B are supported for sure, and their little bros will probably work too.

And here comes the more important question, which is should you try it? The original paper says "we've got fp16 in k4v2". Yeah, sure... Maybe in some benchmarks... But how it holds up in general?

To answer this question, I booted up the good old KLD and started comparing KVarN to my collection of 50-something quant pairs. As usual, we don't look at PPL and other pathetic metrics, we check median and 99.9% KLD over 3 different configs of Qwen 3.6 27B.

And it's not that bad. I mean, compared to the infamous TurboQuant. KVarN actually appears to be punching above it's weight even compared to rotation-enabled llama.cpp quants. Not by much, but we VRAM-constrained folks are happy for every 0.1% of precision.

TL;DR is that it delivers q5 quality at 4-bit, and q4 quality at 3.5-bit. And that's on a very raw implementation. Probably can improved further. Especially speed. For speed I'm not claiming anything at all, it's really is just too raw to compare it. But the mature implementation in paper had it faster than usual quants.

Is it fp16 quality? No. Is it still better than like anything else in llama.cpp ecosystem? Look like yes.

KLD results on Q5_K_S + 64k context

The rest of benchmark data and in-depth analysis are available in the article.

Cache Size Mean KLD Mean precision 99.9% KLD 99.9% precision Tok/s
bf16 100.0% 0.000375 100.00% 0.023258 100.00% 850.81
q8_0 53.1% 0.002328 99.80% 0.078709 94.61% 851.11
q8_0-q5_1 45.3% 0.002529 99.78% 0.082880 94.21% 828.63
q8_0-q4_0 40.6% 0.003316 99.71% 0.104680 92.18% 849.37
q6_0 40.6% 0.002614 99.78% 0.090800 93.47% 845.96
q6_0-q5_0 37.5% 0.002820 99.76% 0.092682 93.29% 846.86
q5_1 37.5% 0.002911 99.75% 0.098354 92.77% 841.65
q5_0 34.4% 0.003206 99.72% 0.099073 92.70% 849.79
q5_0-q4_0 31.3% 0.003581 99.68% 0.113332 91.39% 847.64
q4_0 28.1% 0.004711 99.57% 0.130419 89.84% 855.08
kvarn4-kvarn4 27.9% 0.002974 99.74% 0.094819 93.09% 760.88
q5_0-turbo3_tcq 27.3% 0.005471 99.49% 0.158514 87.35% 815.80
turbo4 25.8% 0.004760 99.55% 0.138370 89.13% 705.32
kvarn4-kvarn3 24.8% 0.003824 99.66% 0.135028 89.42% 765.23
q4_0-turbo3_tcq 24.2% 0.006269 99.41% 0.186572 84.93% 821.89
kvarn4-kvarn2 21.7% 0.010449 99.00% 0.340392 72.82% 765.57
kvarn3-kvarn3 21.7% 0.005349 99.50% 0.168135 86.51% 773.12
turbo3_tcq 20.3% 0.007978 99.24% 0.227104 81.56% 795.20
kvarn3-kvarn2 18.6% 0.011122 98.93% 0.345995 72.42% 773.65
kvarn2-kvarn2 15.4% 0.021395 97.92% 0.630208 54.50% 776.81
turbo2_tcq 14.1% 0.023073 97.76% 0.632401 54.38% 807.25
submitted by /u/Anbeeld
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA