r/LocalLLaMA · June 4, 2026 · 1 min read

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

The KV-cache quant race just got more interesting. Huawei just open-sourced KVarN, a KV-cache quantization method under Apache 2.0, drops into vLLM with one flag. Posting because the tradeoff it's claiming is genuinely different from what's already in the stack, and I'd like to see it stress-tested.

The landscape it's stepping into

FP8 (--kv-cache-dtype fp8) is the current default: ~2x KV capacity, BF16-level throughput, near-zero quality loss. Hard to beat, and the bar anything new has to clear.
TurboQuant (Google) got the headlines this year for aggressive compression. It's the one that spooked memory-chip stocks back in March. But per vLLM's own study (Red Hat AI), it buys that memory by giving up speed: it runs at 66-80% of BF16 throughput, up to ~2.5x slower at burst, because it dequantizes back to BF16 for the attention compute. And its low-bit modes drop ~20 points on reasoning (AIME25, LiveCodeBench).

What KVarN claims (vs FP16)

3-5x more context (vs FP8's ~2x)
up to ~1.4x FP16 throughput, at FP16-quality outputs
up to ~2.4x TurboQuant throughput, at higher accuracy
at matched accuracy, at least as compact as every TurboQuant operating point (their paper's table)
holds reasoning quality at high compression; the exact axis where TurboQuant's low-bit variants fall apart
no model changes, no retraining, no calibration; single vLLM flag

Reasoning benchmarks (from the paper)

https://preview.redd.it/aeyuff7h2a5h1.png?width=738&format=png&auto=webp&s=252a2948ed2e3dca280f967c6016b36e73f3858c

This is the part that matters. Most KV-cache quant tanks either math/code accuracy or throughput; KVarN claims neither.

Throughput with vLLM v. Compression (from repo readme)

https://preview.redd.it/11lhlua73a5h1.png?width=1216&format=png&auto=webp&s=2b50ac0169708511cb3b29f84084fafeda94fed1

Links