r/MachineLearning ยท ยท 1 min read

๐ƒ๐ž๐ฅ๐ญ๐š ๐€๐ญ๐ญ๐ž๐ง๐ญ๐ข๐จ๐ง ๐‘๐ž๐ฌ๐ข๐๐ฎ๐š๐ฅ๐ฌ [R]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

๐ƒ๐ž๐ฅ๐ญ๐š ๐€๐ญ๐ญ๐ž๐ง๐ญ๐ข๐จ๐ง ๐‘๐ž๐ฌ๐ข๐๐ฎ๐š๐ฅ๐ฌ [R]

We're excited to release ๐ƒ๐ž๐ฅ๐ญ๐š ๐€๐ญ๐ญ๐ž๐ง๐ญ๐ข๐จ๐ง ๐‘๐ž๐ฌ๐ข๐๐ฎ๐š๐ฅ๐ฌ, a drop-in upgrade to residual connections that learns which past layers to route from โ€” without the routing collapse that breaks prior cross-layer attention at scale. ๐Ÿš€

Attention Residuals route over cumulative hidden states, but those are highly redundant, so routing collapses to near-uniform (max weight ~0.2) in deep layers. Delta Attention Residuals route over ๐๐ž๐ฅ๐ญ๐š๐ฌ (vแตข = hแตขโ‚Šโ‚ โˆ’ hแตข) โ€” what each sublayer actually contributed โ€” and natively enable:

โšก ๐Ÿ.๐Ÿ–ร— ๐ฌ๐ก๐š๐ซ๐ฉ๐ž๐ซ ๐œ๐ซ๐จ๐ฌ๐ฌ-๐ฅ๐š๐ฒ๐ž๐ซ ๐ซ๐จ๐ฎ๐ญ๐ข๐ง๐  Deltas are structurally diverse, lifting max attention weight from ~0.2 โ†’ ~0.6 (0.62 vs 0.35 avg) and curing routing collapse in deep layers.

๐Ÿ“‰ โˆ’๐Ÿ–.๐Ÿ% ๐ฏ๐š๐ฅ๐ข๐๐š๐ญ๐ข๐จ๐ง ๐๐๐‹ ๐š๐ญ ๐Ÿ•.๐Ÿ”๐ Consistent gains from 220M โ†’ 7.6B (1.7โ€“8.2% lower PPL), beating both standard residuals and Attention Residuals โ€” the latter actually degrades below baseline at scale (18.58 vs 17.43).

๐Ÿ”Œ ๐ƒ๐ซ๐จ๐ฉ-๐ข๐ง ๐Ÿ๐ข๐ง๐ž-๐ญ๐ฎ๐ง๐ข๐ง๐  ๐จ๐Ÿ ๐ฉ๐ซ๐ž๐ญ๐ซ๐š๐ข๐ง๐ž๐ ๐ฆ๐จ๐๐ž๐ฅ๐ฌ Additive, zero-init routing is identity at initialization, so you can convert pretrained checkpoints (e.g. Qwen3-0.6B) into Delta Attention Residuals via standard fine-tuning โ€” beating the original on 8 downstream benchmarks (55.6 vs 55.0).

๐Ÿชถ โ‰ค๐ŸŽ.๐ŸŽ๐Ÿ% ๐ฉ๐š๐ซ๐š๐ฆ๐ž๐ญ๐ž๐ซ ๐จ๐ฏ๐ž๐ซ๐ก๐ž๐š๐ Delta Block adds just 589K params (0.008% at 8B) and ~3% memory โ€” and runs faster + lighter than Attention Residuals (14.0k vs 12.5k tok/s, 42.7 vs 44.0 GB).

๐Ÿ’ป Code: https://github.com/wdlctc/delta-attention-residuals-code

๐Ÿ’ป Paper: https://arxiv.org/abs/2605.18855

https://preview.redd.it/bewovgw25b3h1.png?width=1359&format=png&auto=webp&s=6cee758f7a96f0adecd9a3fb8553dde3f1b92c74

submitted by /u/Mediocre-Ad5059
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds โ€” email code or GitHub.

Sign in โ†’

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning