Besides focusing solely on improving downstream inference methods, we can also design upstream architectures that are inherently more capable with sparse inference. This paper is based on our previous RAT (NeurIPS 2025) and RAT+ (ICML 2026), where we augment attention with an additional recurrence to support flexible dilated pattern at inference. In this paper, we further prove that such an architecture boosts other inference-time sparsity as well!</p>\n","updatedAt":"2026-06-08T13:26:27.485Z","author":{"_id":"654df8932fdbbde41e809968","avatarUrl":"/avatars/4d43b91387428f4c267fe248039cfad5.svg","fullname":"XiuyingWei","name":"barpitf","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9336864352226257},"editors":["barpitf"],"editorAvatarUrls":["/avatars/4d43b91387428f4c267fe248039cfad5.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.28640","authors":[{"_id":"6a26c203da05d61ad5d10cfd","name":"Xiuying Wei","hidden":false},{"_id":"6a26c203da05d61ad5d10cfe","name":"Caglar Gulcehre","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity","submittedOnDailyBy":{"_id":"654df8932fdbbde41e809968","avatarUrl":"/avatars/4d43b91387428f4c267fe248039cfad5.svg","isPro":false,"fullname":"XiuyingWei","user":"barpitf","type":"user","name":"barpitf"},"summary":"Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.","upvotes":0,"discussionId":"6a26c204da05d61ad5d10cff","projectPage":"https://huggingface.co/barpitf/ratplus","githubRepo":"https://github.com/wimh966/rat-plus","githubRepoAddedBy":"user","ai_summary":"RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.","ai_keywords":["RAT+","attention computation","KV-cache","dilated attention","sparse inference","Quest","MoBA","SnapKV","needle-in-a-haystack tasks","OLMo2-7B","memory module"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.28640.md"}">
Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity
Abstract
RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.
Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.
Community
Besides focusing solely on improving downstream inference methods, we can also design upstream architectures that are inherently more capable with sparse inference. This paper is based on our previous RAT (NeurIPS 2025) and RAT+ (ICML 2026), where we augment attention with an additional recurrence to support flexible dilated pattern at inference. In this paper, we further prove that such an architecture boosts other inference-time sparsity as well!
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.28640 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.28640 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.28640 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.