KVarN: new KV-cache quant from Huawei. 3–5× KV cache compression with actual speed-up instead of slow-down, and unlike TurboQuant it holds up on reasoning (Apache 2.0, vLLM single flag)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| The KV-cache quant race just got more interesting. Huawei just open-sourced KVarN, a KV-cache quantization method under Apache 2.0, drops into vLLM with one flag. Posting because the tradeoff it's claiming is genuinely different from what's already in the stack, and I'd like to see it stress-tested. The landscape it's stepping into
What KVarN claims (vs FP16)
Reasoning benchmarks (from the paper) This is the part that matters. Most KV-cache quant tanks either math/code accuracy or throughput; KVarN claims neither. Throughput with vLLM v. Compression (from repo readme) Links
It looks like they learned from the SINQ https://www.reddit.com/r/LocalLLaMA/comments/1nxjh4c/github_huaweicslsinq_welcome_to_the_official/ case where everyone was asking for throughput numbers and vLLM integration 😃 [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.