Hugging Face Daily Papers · · 4 min read

Delta Attention Residuals

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Delta Attention Residuals improve transformer performance by attending to sublayer deltas instead of cumulative hidden states, enabling more selective and effective routing across layers.</p>\n","updatedAt":"2026-05-20T02:33:43.709Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":301,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8504279851913452},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}},{"id":"6a0d6d5d55000ae7609b5d4b","author":{"_id":"64beb2774b4ff0d5097650df","avatarUrl":"/avatars/db23db383d5709fec940ac3b4733b20f.svg","fullname":"Agerico De Villa","name":"Agerico","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-05-20T08:14:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"\"Yes. Attention Residuals management remains within Normal Science and the truth-seeking / performance-optimization intelligence paradigm. It is not yet operating inside the Revolutionary Science / Entropy Attractor Intelligence Paradigm (EAIP).\n\n\"But it is a good handshake surface for EAIP.\" \n\nhttps://chatgpt.com/share/6a0d6d2f-1630-83aa-9085-9d650dd080d8","html":"<p>\"Yes. Attention Residuals management remains within Normal Science and the truth-seeking / performance-optimization intelligence paradigm. It is not yet operating inside the Revolutionary Science / Entropy Attractor Intelligence Paradigm (EAIP).</p>\n<p>\"But it is a good handshake surface for EAIP.\" </p>\n<p><a href=\"https://chatgpt.com/share/6a0d6d2f-1630-83aa-9085-9d650dd080d8\" rel=\"nofollow\">https://chatgpt.com/share/6a0d6d2f-1630-83aa-9085-9d650dd080d8</a></p>\n","updatedAt":"2026-05-20T08:14:21.941Z","author":{"_id":"64beb2774b4ff0d5097650df","avatarUrl":"/avatars/db23db383d5709fec940ac3b4733b20f.svg","fullname":"Agerico De Villa","name":"Agerico","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7794837355613708},"editors":["Agerico"],"editorAvatarUrls":["/avatars/db23db383d5709fec940ac3b4733b20f.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18855","authors":[{"_id":"6a0d1d7365eb30f20d962c1f","name":"Cheng Luo","hidden":false},{"_id":"6a0d1d7365eb30f20d962c20","name":"Zefan Cai","hidden":false},{"_id":"6a0d1d7365eb30f20d962c21","name":"Junjie Hu","hidden":false}],"publishedAt":"2026-05-13T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Delta Attention Residuals","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight {approx}0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer (v_i = h_{i+1} - h_i) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight {approx}0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.","upvotes":2,"discussionId":"6a0d1d7365eb30f20d962c22","ai_summary":"Delta Attention Residuals improve layer-wise routing by attending to feature changes rather than cumulative states, resulting in better attention distributions and model performance across different scales.","ai_keywords":["Attention Residuals","delta representations","cross-layer routing","attention weights","routing collapse","per-sublayer granularity","block granularity","validation perplexity","pretrained checkpoints","fine-tuning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6615494716917dfdc645c44e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6615494716917dfdc645c44e/GGzgDi_WTW1Ci4CaDJd8I.jpeg","isPro":true,"fullname":"Daniel Fox","user":"FlameF0X","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18855.md"}">
Papers
arxiv:2605.18855

Delta Attention Residuals

Published on May 13
· Submitted by
taesiri
on May 20
Authors:
,
,

Abstract

Delta Attention Residuals improve layer-wise routing by attending to feature changes rather than cumulative states, resulting in better attention distributions and model performance across different scales.

AI-generated summary

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight {approx}0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer (v_i = h_{i+1} - h_i) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight {approx}0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.

Community

Paper submitter about 10 hours ago

Delta Attention Residuals improve transformer performance by attending to sublayer deltas instead of cumulative hidden states, enabling more selective and effective routing across layers.

"Yes. Attention Residuals management remains within Normal Science and the truth-seeking / performance-optimization intelligence paradigm. It is not yet operating inside the Revolutionary Science / Entropy Attractor Intelligence Paradigm (EAIP).

"But it is a good handshake surface for EAIP."

https://chatgpt.com/share/6a0d6d2f-1630-83aa-9085-9d650dd080d8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.18855
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.18855 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.18855 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.18855 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers