Hugging Face Daily Papers · · 3 min read

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Besides focusing solely on improving downstream inference methods, we can also design upstream architectures that are inherently more capable with sparse inference. This paper is based on our previous RAT (NeurIPS 2025) and RAT+ (ICML 2026), where we augment attention with an additional recurrence to support flexible dilated pattern at inference. In this paper, we further prove that such an architecture boosts other inference-time sparsity as well!</p>\n","updatedAt":"2026-06-08T13:26:27.485Z","author":{"_id":"654df8932fdbbde41e809968","avatarUrl":"/avatars/4d43b91387428f4c267fe248039cfad5.svg","fullname":"XiuyingWei","name":"barpitf","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9336864352226257},"editors":["barpitf"],"editorAvatarUrls":["/avatars/4d43b91387428f4c267fe248039cfad5.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.28640","authors":[{"_id":"6a26c203da05d61ad5d10cfd","name":"Xiuying Wei","hidden":false},{"_id":"6a26c203da05d61ad5d10cfe","name":"Caglar Gulcehre","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity","submittedOnDailyBy":{"_id":"654df8932fdbbde41e809968","avatarUrl":"/avatars/4d43b91387428f4c267fe248039cfad5.svg","isPro":false,"fullname":"XiuyingWei","user":"barpitf","type":"user","name":"barpitf"},"summary":"Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.","upvotes":0,"discussionId":"6a26c204da05d61ad5d10cff","projectPage":"https://huggingface.co/barpitf/ratplus","githubRepo":"https://github.com/wimh966/rat-plus","githubRepoAddedBy":"user","ai_summary":"RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.","ai_keywords":["RAT+","attention computation","KV-cache","dilated attention","sparse inference","Quest","MoBA","SnapKV","needle-in-a-haystack tasks","OLMo2-7B","memory module"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.28640.md"}">
Papers
arxiv:2605.28640

Augmenting Attention with Exponentially Decaying Memory Improves Query-Aware KV Sparsity

Published on May 27
· Submitted by
XiuyingWei
on Jun 8
Authors:
,

Abstract

RAT+ memory module enhances query-aware sparse inference methods by improving accuracy in long-context language models across various sparse budgets.

Efficient inference is critical for long-context language models, where attention computation and KV-cache access dominate the cost. Recent work RAT+, introduces a recurrence-augmented attention backbone that enables flexible dilated attention at inference time. In this paper, we investigate whether this exponentially decaying memory can also improve existing query-aware sparse inference methods. Using representative methods including Quest, MoBA, and SnapKV, we show that RAT+ consistently improves accuracy over standard attention across sparse budgets on eight needle-in-a-haystack tasks. We validate these gains both on the released checkpoints from the RAT+ paper and on OLMo2-7B, which we continue pretraining with the added memory module for 10B tokens. Finally, we propose two hypotheses explaining why this memory module benefits query-aware sparse inference and design targeted experiments to support them.

Community

Paper submitter about 7 hours ago

Besides focusing solely on improving downstream inference methods, we can also design upstream architectures that are inherently more capable with sparse inference. This paper is based on our previous RAT (NeurIPS 2025) and RAT+ (ICML 2026), where we augment attention with an additional recurrence to support flexible dilated pattern at inference. In this paper, we further prove that such an architecture boosts other inference-time sparsity as well!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.28640
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.28640 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28640 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28640 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers