r/LocalLLaMA · · 1 min read

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)

The KV-cache quant race just got more interesting. Huawei just open-sourced KVarN, a KV-cache quantization method under Apache 2.0, drops into vLLM with one flag. Posting because the tradeoff it's claiming is genuinely different from what's already in the stack, and I'd like to see it stress-tested.

The landscape it's stepping into

  • FP8 (--kv-cache-dtype fp8) is the current default: ~2x KV capacity, BF16-level throughput, near-zero quality loss. Hard to beat, and the bar anything new has to clear.
  • TurboQuant (Google) got the headlines this year for aggressive compression. It's the one that spooked memory-chip stocks back in March. But per vLLM's own study (Red Hat AI), it buys that memory by giving up speed: it runs at 66-80% of BF16 throughput, up to ~2.5x slower at burst, because it dequantizes back to BF16 for the attention compute. And its low-bit modes drop ~20 points on reasoning (AIME25, LiveCodeBench).

What KVarN claims (vs FP16)

  • 3-5x more context (vs FP8's ~2x)
  • up to ~1.4x FP16 throughput, at FP16-quality outputs
  • up to ~2.4x TurboQuant throughput, at higher accuracy
  • at matched accuracy, at least as compact as every TurboQuant operating point (their paper's table)
  • holds reasoning quality at high compression; the exact axis where TurboQuant's low-bit variants fall apart
  • no model changes, no retraining, no calibration; single vLLM flag

Reasoning benchmarks (from the paper)

https://preview.redd.it/aeyuff7h2a5h1.png?width=738&format=png&auto=webp&s=252a2948ed2e3dca280f967c6016b36e73f3858c

This is the part that matters. Most KV-cache quant tanks either math/code accuracy or throughput; KVarN claims neither.

Throughput with vLLM v. Compression (from repo readme)

https://preview.redd.it/11lhlua73a5h1.png?width=1216&format=png&auto=webp&s=2b50ac0169708511cb3b29f84084fafeda94fed1

Links

It looks like they learned from the SINQ https://www.reddit.com/r/LocalLLaMA/comments/1nxjh4c/github_huaweicslsinq_welcome_to_the_official/ case where everyone was asking for throughput numbers and vLLM integration 😃

submitted by /u/acluk90
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA