Hugging Face Daily Papers · May 22, 2026 · 4 min read

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Gated DeltaNet-2 introduces a linear attention architecture that improves memory management by decoupling erase and write operations, achieving superior performance on long-context benchmarks compared to existing recurrent models.</p>\n","updatedAt":"2026-05-22T01:59:17.799Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":303,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8198338150978088},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22791","authors":[{"_id":"6a0fb85da53a61ce2e422c16","name":"Ali Hatamizadeh","hidden":false},{"_id":"6a0fb85da53a61ce2e422c17","name":"Yejin Choi","hidden":false},{"_id":"6a0fb85da53a61ce2e422c18","name":"Jan Kautz","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.","upvotes":10,"discussionId":"6a0fb85ea53a61ce2e422c19","githubRepo":"https://github.com/NVlabs/GatedDeltaNet-2","githubRepoAddedBy":"user","ai_summary":"Gated DeltaNet-2 improves upon existing linear attention models by separating erase and write operations through distinct channel-wise gates, achieving superior performance in long-context language modeling and retrieval tasks.","ai_keywords":["linear attention","softmax attention","recurrent state","delta-rule models","Kimi Delta Attention","Gated DeltaNet","channel-wise decay","erase gate","write gate","fast-weight update","chunkwise WY algorithm","gate-aware backward pass","Mamba-2","Mamba-3","RULER","needle-in-a-haystack benchmarks"],"githubStars":19,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"669d9e56fe9496b3c6db6e7e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/669d9e56fe9496b3c6db6e7e/kanTsLdBIfGyyZbNl34pf.jpeg","isPro":false,"fullname":"Ilya Pereverzin","user":"NodeLinker","type":"user"},{"_id":"636125b8108cc69e687c5bcf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636125b8108cc69e687c5bcf/stJ6k-QG3_8dY57PkKUD_.jpeg","isPro":true,"fullname":"Lung-Chuan Chen","user":"Blaze7451","type":"user"},{"_id":"64414b62603214724ebd2636","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64414b62603214724ebd2636/jysKDUlixcy4FW6bmeYoX.png","isPro":false,"fullname":"Ali","user":"ahatamiz","type":"user"},{"_id":"65bb837dbfb878f46c77de4c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bb837dbfb878f46c77de4c/23gZ_lBEwyoqjexFy9QLD.jpeg","isPro":true,"fullname":"Prithiv Sakthi","user":"prithivMLmods","type":"user"},{"_id":"69bb7f3e5463ded25e371376","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/hcUquWlFTBseaeQFqhXkE.png","isPro":false,"fullname":"임서준","user":"isaackiik","type":"user"},{"_id":"66baed161caaa1d77c93aff3","avatarUrl":"/avatars/13ea79f556c9def2696b6d27b7c03aef.svg","isPro":false,"fullname":"Seongsik Park","user":"a163912","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64b2f97434a92b848c7e941e","avatarUrl":"/avatars/c699c50f3b43cd1641469521127753bb.svg","isPro":false,"fullname":"Nagori","user":"MohammedNaeem","type":"user"},{"_id":"5e6a3d4ea9afd5125d9ec064","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1584020801691-noauth.jpeg","isPro":true,"fullname":"Stefan Schweter","user":"stefan-it","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22791.md"}">

Papers

arxiv:2605.22791

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Published on May 21

· Submitted by

taesiri on May 22

NVIDIA

Upvote

Authors:

Abstract

Gated DeltaNet-2 improves upon existing linear attention models by separating erase and write operations through distinct channel-wise gates, achieving superior performance in long-context language modeling and retrieval tasks.

AI-generated summary

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.

View arXiv page View PDF GitHub 19 Add to collection

Community

taesiri

Paper submitter about 10 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.22791

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22791 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.22791 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22791 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers