Hi! KVarN is finally here! </p>\n<p>Happy to chat about our paper :)</p>\n","updatedAt":"2026-06-03T14:54:33.643Z","author":{"_id":"68b1e03b8aefe9d999b719f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/XyPVTSmon49qmwELaMrnX.png","fullname":"Philippe Bich","name":"pbicho","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9473876357078552},"editors":["pbicho"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/XyPVTSmon49qmwELaMrnX.png"],"reactions":[{"reaction":"🚀","users":["pbicho","lukasc-ch","lokamu"],"count":3}],"isReport":false}},{"id":"6a2057ad13256663a9ed0b84","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false},"createdAt":"2026-06-03T16:34:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Very cool and useful.","html":"<p>Very cool and useful.</p>\n","updatedAt":"2026-06-03T16:34:53.146Z","author":{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","fullname":"Urro","name":"urroxyz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8435264825820923},"editors":["urroxyz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.03458","authors":[{"_id":"6a203fea15100c5272a84417","name":"Lorenz K. Muller","hidden":false},{"_id":"6a203fea15100c5272a84418","name":"Philippe Bich","hidden":false},{"_id":"6a203fea15100c5272a84419","name":"Chiara Boretti","hidden":false},{"_id":"6a203fea15100c5272a8441a","name":"Hyun-Min Chang","hidden":false},{"_id":"6a203fea15100c5272a8441b","name":"Jiawei Zhuang","hidden":false},{"_id":"6a203fea15100c5272a8441c","name":"Lukas Cavigelli","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/68b1e03b8aefe9d999b719f2/Px9tWc8wYul4Ciswwu25M.mp4"],"publishedAt":"2026-06-02T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks","submittedOnDailyBy":{"_id":"68b1e03b8aefe9d999b719f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/XyPVTSmon49qmwELaMrnX.png","isPro":false,"fullname":"Philippe Bich","user":"pbicho","type":"user","name":"pbicho"},"summary":"Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN","upvotes":25,"discussionId":"6a203fea15100c5272a8441d","projectPage":"https://github.com/huawei-csl/KVarN","githubRepo":"https://github.com/huawei-csl/KVarN","githubRepoAddedBy":"user","ai_summary":"KVarN is a calibration-free KV-cache quantizer that uses Hadamard rotation and dual-scaling variance normalization to reduce error accumulation during autoregressive decoding in large language models.","ai_keywords":["KV-cache quantization","autoregressive decoding","Hadamard rotation","dual-scaling variance normalization","error accumulation","token scales","KVarN"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":25,"organization":{"_id":"68dd34af7ffcb962c2e1c461","name":"huawei-csl","fullname":"HUAWEI Computing Systems Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6442ef61860f7a25bef0ea51/rkv-GMqP_NCzoQxXhsvuW.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68b1e03b8aefe9d999b719f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/XyPVTSmon49qmwELaMrnX.png","isPro":false,"fullname":"Philippe Bich","user":"pbicho","type":"user"},{"_id":"6442ef61860f7a25bef0ea51","avatarUrl":"/avatars/8cccac5ac73498fa790e193a908c2057.svg","isPro":false,"fullname":"Lukas Cavigelli","user":"lukasc-ch","type":"user"},{"_id":"68e923d470aa8961570074a6","avatarUrl":"/avatars/e75bd1a7164839521284f0576bb86369.svg","isPro":false,"fullname":"Chiara Boretti","user":"chiaraboretti","type":"user"},{"_id":"6304c149dae2eb7d08407030","avatarUrl":"/avatars/efdca8cb990b16054970286382a2bbbd.svg","isPro":false,"fullname":"Lorenz Müller","user":"lokamu","type":"user"},{"_id":"6866980170bf4e858dd7bfeb","avatarUrl":"/avatars/520230eb0f015a4995d158215dfeac18.svg","isPro":false,"fullname":"Julien Vincent Eudine","user":"Julien234","type":"user"},{"_id":"68126df61c5b434f88876f2e","avatarUrl":"/avatars/441e017765f508fb77bbfbf731844331.svg","isPro":false,"fullname":"Igor Pavlovic","user":"igzi","type":"user"},{"_id":"692dba5ea479e5fb4f5ddb9e","avatarUrl":"/avatars/20fb7635ec3e072fe5beb5c656aaf13c.svg","isPro":false,"fullname":"George Bisbas","user":"georgebisbas","type":"user"},{"_id":"6670d9492e3154947fc485a5","avatarUrl":"/avatars/db3e07ced7072b8cca5e670ff9c302ab.svg","isPro":false,"fullname":"Hyun-Min Chang","user":"Mocchibird","type":"user"},{"_id":"671a701c15578cd5aa5fe203","avatarUrl":"/avatars/5ebd12f691f29c9eaa1db35b663b9685.svg","isPro":false,"fullname":"Felix Arnold","user":"plex1","type":"user"},{"_id":"6943d1bba0c3da3431d095d8","avatarUrl":"/avatars/9ddc40084f6943095fdadb177b8fa570.svg","isPro":false,"fullname":"Niclas","user":"vniclas","type":"user"},{"_id":"66545656f8137bb650d9dc8b","avatarUrl":"/avatars/5910a42e7da793635717e76a94e80037.svg","isPro":false,"fullname":"Ahmet Yuzuguler","user":"acyuzuguler","type":"user"},{"_id":"64b99bf99ac0b723d7d32ade","avatarUrl":"/avatars/0c9d258547dc9ce7fd00417b093343ab.svg","isPro":false,"fullname":"Axel Laborieux","user":"A-bao","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68dd34af7ffcb962c2e1c461","name":"huawei-csl","fullname":"HUAWEI Computing Systems Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6442ef61860f7a25bef0ea51/rkv-GMqP_NCzoQxXhsvuW.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.03458.md"}">
KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks
Abstract
KVarN is a calibration-free KV-cache quantizer that uses Hadamard rotation and dual-scaling variance normalization to reduce error accumulation during autoregressive decoding in large language models.
Test-time scaling is a powerful approach to obtain better reasoning in large language models, but it becomes memory-bottlenecked during long-horizon decoding, as the KV-cache grows. KV-cache quantization can help improve this, but current methods are evaluated under prefill-like settings and errors behave differently under autoregressive decoding. We show that in the latter regime, quantization errors accumulate across timesteps, driven primarily by incorrect token scales. We introduce KVarN, a calibration-free KV-cache quantizer that applies a Hadamard rotation followed by a dual-scaling variance normalization across both axes of the K and V matrices. We find that this combination fixes outlying token-scale errors and substantially reduces error accumulation over existing baselines. KVarN establishes a new state-of-theart for KV-cache quantization on generative benchmarks, including MATH500, AIME24 and HumanEval, at 2-bit precision. A vLLM implementation of the KVarN method is available at https://github.com/huawei-csl/KVarN
Community
Hi! KVarN is finally here!
Happy to chat about our paper :)
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.03458 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.03458 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.03458 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.