r/LocalLLaMA · June 5, 2026 · 3 min read

I implemented KVarN in my llama.cpp fork and ran KLD benchmarks. It's promising!

#reasoning #benchmark #open-source #inference

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Saw this post here yesterday: KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

Cheap KV cache with good precision? Sign me up! Oh, vLLM only...

Wait, I do have my own llama.cpp fork, and I do have an extensive reference for KLD benchmarking. I should act!

And so I acted. Until 6 am.

So now KVarN is implemented in a publicly available BeeLlama.cpp v0.3.2 Preview, and you can literally just try it yourself: download a prebuilt, launch it with --cache-type-k kvarn4 and --cache-type-v kvarn4 or whatever bits you want, enjoy the ride. If it works on your platform, because I only have RTX 3090 for testing. Qwen 3.6 27B and Gemma 4 31B are supported for sure, and their little bros will probably work too.

And here comes the more important question, which is should you try it? The original paper says "we've got fp16 in k4v2". Yeah, sure... Maybe in some benchmarks... But how it holds up in general?

To answer this question, I booted up the good old KLD and started comparing KVarN to my collection of 50-something quant pairs. As usual, we don't look at PPL and other pathetic metrics, we check median and 99.9% KLD over 3 different configs of Qwen 3.6 27B.

And it's not that bad. I mean, compared to the infamous TurboQuant. KVarN actually appears to be punching above it's weight even compared to rotation-enabled llama.cpp quants. Not by much, but we VRAM-constrained folks are happy for every 0.1% of precision.

TL;DR is that it delivers q5 quality at 4-bit, and q4 quality at 3.5-bit. And that's on a very raw implementation. Probably can improved further. Especially speed. For speed I'm not claiming anything at all, it's really is just too raw to compare it. But the mature implementation in paper had it faster than usual quants.

Is it fp16 quality? No. Is it still better than like anything else in llama.cpp ecosystem? Look like yes.

KLD results on Q5_K_S + 64k context

The rest of benchmark data and in-depth analysis are available in the article.

Cache	Size	Mean KLD	Mean precision	99.9% KLD	99.9% precision	Tok/s
bf16	100.0%	0.000375	100.00%	0.023258	100.00%	850.81
q8_0	53.1%	0.002328	99.80%	0.078709	94.61%	851.11
q8_0-q5_1	45.3%	0.002529	99.78%	0.082880	94.21%	828.63
q8_0-q4_0	40.6%	0.003316	99.71%	0.104680	92.18%	849.37
q6_0	40.6%	0.002614	99.78%	0.090800	93.47%	845.96
q6_0-q5_0	37.5%	0.002820	99.76%	0.092682	93.29%	846.86
q5_1	37.5%	0.002911	99.75%	0.098354	92.77%	841.65
q5_0	34.4%	0.003206	99.72%	0.099073	92.70%	849.79
q5_0-q4_0	31.3%	0.003581	99.68%	0.113332	91.39%	847.64
q4_0	28.1%	0.004711	99.57%	0.130419	89.84%	855.08
kvarn4-kvarn4	27.9%	0.002974	99.74%	0.094819	93.09%	760.88
q5_0-turbo3_tcq	27.3%	0.005471	99.49%	0.158514	87.35%	815.80
turbo4	25.8%	0.004760	99.55%	0.138370	89.13%	705.32
kvarn4-kvarn3	24.8%	0.003824	99.66%	0.135028	89.42%	765.23
q4_0-turbo3_tcq	24.2%	0.006269	99.41%	0.186572	84.93%	821.89
kvarn4-kvarn2	21.7%	0.010449	99.00%	0.340392	72.82%	765.57
kvarn3-kvarn3	21.7%	0.005349	99.50%	0.168135	86.51%	773.12
turbo3_tcq	20.3%	0.007978	99.24%	0.227104	81.56%	795.20
kvarn3-kvarn2	18.6%	0.011122	98.93%	0.345995	72.42%	773.65
kvarn2-kvarn2	15.4%	0.021395	97.92%	0.630208	54.50%	776.81
turbo2_tcq	14.1%	0.023073	97.76%	0.632401	54.38%	807.25

submitted by /u/Anbeeld
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA