๐๐๐ฅ๐ญ๐ ๐๐ญ๐ญ๐๐ง๐ญ๐ข๐จ๐ง ๐๐๐ฌ๐ข๐๐ฎ๐๐ฅ๐ฌ [R]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
| We're excited to release ๐๐๐ฅ๐ญ๐ ๐๐ญ๐ญ๐๐ง๐ญ๐ข๐จ๐ง ๐๐๐ฌ๐ข๐๐ฎ๐๐ฅ๐ฌ, a drop-in upgrade to residual connections that learns which past layers to route from โ without the routing collapse that breaks prior cross-layer attention at scale. ๐ Attention Residuals route over cumulative hidden states, but those are highly redundant, so routing collapses to near-uniform (max weight ~0.2) in deep layers. Delta Attention Residuals route over ๐๐๐ฅ๐ญ๐๐ฌ (vแตข = hแตขโโ โ hแตข) โ what each sublayer actually contributed โ and natively enable: โก ๐.๐ร ๐ฌ๐ก๐๐ซ๐ฉ๐๐ซ ๐๐ซ๐จ๐ฌ๐ฌ-๐ฅ๐๐ฒ๐๐ซ ๐ซ๐จ๐ฎ๐ญ๐ข๐ง๐ Deltas are structurally diverse, lifting max attention weight from ~0.2 โ ~0.6 (0.62 vs 0.35 avg) and curing routing collapse in deep layers. ๐ โ๐.๐% ๐ฏ๐๐ฅ๐ข๐๐๐ญ๐ข๐จ๐ง ๐๐๐ ๐๐ญ ๐.๐๐ Consistent gains from 220M โ 7.6B (1.7โ8.2% lower PPL), beating both standard residuals and Attention Residuals โ the latter actually degrades below baseline at scale (18.58 vs 17.43). ๐ ๐๐ซ๐จ๐ฉ-๐ข๐ง ๐๐ข๐ง๐-๐ญ๐ฎ๐ง๐ข๐ง๐ ๐จ๐ ๐ฉ๐ซ๐๐ญ๐ซ๐๐ข๐ง๐๐ ๐ฆ๐จ๐๐๐ฅ๐ฌ Additive, zero-init routing is identity at initialization, so you can convert pretrained checkpoints (e.g. Qwen3-0.6B) into Delta Attention Residuals via standard fine-tuning โ beating the original on 8 downstream benchmarks (55.6 vs 55.0). ๐ชถ โค๐.๐๐% ๐ฉ๐๐ซ๐๐ฆ๐๐ญ๐๐ซ ๐จ๐ฏ๐๐ซ๐ก๐๐๐ Delta Block adds just 589K params (0.008% at 8B) and ~3% memory โ and runs faster + lighter than Attention Residuals (14.0k vs 12.5k tok/s, 42.7 vs 44.0 GB). ๐ป Code: https://github.com/wdlctc/delta-attention-residuals-code ๐ป Paper: https://arxiv.org/abs/2605.18855 [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds โ email code or GitHub.
Sign in โNo comments yet. Sign in and be the first to say something.