Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance</p>\n","updatedAt":"2026-06-17T06:30:20.471Z","author":{"_id":"608f6d72283d0a8d7be9d1f9","avatarUrl":"/avatars/7f499a37019359a3c488ba6cc11751fc.svg","fullname":"Chaojun XIAO","name":"xcjthu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8903737664222717},"editors":["xcjthu"],"editorAvatarUrls":["/avatars/7f499a37019359a3c488ba6cc11751fc.svg"],"reactions":[],"isReport":false}},{"id":"6a32e6c9a0e8b12462238010","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false},"createdAt":"2026-06-17T18:26:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Neat paper. The finding that efficient attention modules shape the optimization trajectory rather than just performance is a cool perspective. It’s wild that a larger sliding window can actually delay the formation of retrieval heads in full-attention layers.\n\nDoes this imply that the \"Large-Window Laziness\" phenomenon could be avoided by staggering the training of different layers, or is it an inherent trade-off for these hybrid architectures?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/4c48aa6d-94a7-4721-aea5-d84471ff70a9","html":"<p>Neat paper. The finding that efficient attention modules shape the optimization trajectory rather than just performance is a cool perspective. It’s wild that a larger sliding window can actually delay the formation of retrieval heads in full-attention layers.</p>\n<p>Does this imply that the \"Large-Window Laziness\" phenomenon could be avoided by staggering the training of different layers, or is it an inherent trade-off for these hybrid architectures?</p>\n<p>I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:<br><a href=\"https://researchpod.app/episode/4c48aa6d-94a7-4721-aea5-d84471ff70a9\" rel=\"nofollow\">https://researchpod.app/episode/4c48aa6d-94a7-4721-aea5-d84471ff70a9</a></p>\n","updatedAt":"2026-06-17T18:26:17.695Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8910332918167114},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15378","authors":[{"_id":"6a323e9ebc818ff14e453e8c","user":{"_id":"6560a10eb9218ed1a731f863","avatarUrl":"/avatars/93b70006650cce8bdaf5d132edde9169.svg","isPro":false,"fullname":"EdenQiao","user":"EdenQiao","type":"user","name":"EdenQiao"},"name":"Ziqing Qiao","status":"claimed_verified","statusLastChangedAt":"2026-06-17T11:20:44.190Z","hidden":false},{"_id":"6a323e9ebc818ff14e453e8d","name":"Yinuo Xu","hidden":false},{"_id":"6a323e9ebc818ff14e453e8e","name":"Chaojun Xiao","hidden":false},{"_id":"6a323e9ebc818ff14e453e8f","name":"Zhou Su","hidden":false},{"_id":"6a323e9ebc818ff14e453e90","name":"Zihan Zhou","hidden":false},{"_id":"6a323e9ebc818ff14e453e91","name":"Yingfa Chen","hidden":false},{"_id":"6a323e9ebc818ff14e453e92","name":"Xiaoyue Xu","hidden":false},{"_id":"6a323e9ebc818ff14e453e93","name":"Xu Han","hidden":false},{"_id":"6a323e9ebc818ff14e453e94","name":"Zhiyuan Liu","hidden":false}],"publishedAt":"2026-06-13T00:00:00.000Z","submittedOnDailyAt":"2026-06-17T00:00:00.000Z","title":"Rethinking the Role of Efficient Attention in Hybrid Architectures","submittedOnDailyBy":{"_id":"608f6d72283d0a8d7be9d1f9","avatarUrl":"/avatars/7f499a37019359a3c488ba6cc11751fc.svg","isPro":false,"fullname":"Chaojun XIAO","user":"xcjthu","type":"user","name":"xcjthu"},"summary":"Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.","upvotes":11,"discussionId":"6a323e9ebc818ff14e453e95","githubRepo":"https://github.com/thunlp/rethinking-hybrid-attention","githubRepoAddedBy":"user","ai_summary":"Hybrid architectures combining full attention with efficient attention modules like sliding-window attention exhibit distinct scaling behaviors and optimization trajectories, with efficient attention primarily affecting the emergence speed of long-context capabilities rather than final performance.","ai_keywords":["hybrid architectures","full attention","efficient attention modules","sliding-window attention","recurrent sequence mixers","scaling behavior","mechanism analysis","architecture design","long-range retrieval","optimization trajectory","Large-Window Laziness","NoPE"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"633fe81429b5a95f6e16e34a","name":"openbmb","fullname":"OpenBMB","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670387859384-633fe7784b362488336bbfad.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6560a10eb9218ed1a731f863","avatarUrl":"/avatars/93b70006650cce8bdaf5d132edde9169.svg","isPro":false,"fullname":"EdenQiao","user":"EdenQiao","type":"user"},{"_id":"608f6d72283d0a8d7be9d1f9","avatarUrl":"/avatars/7f499a37019359a3c488ba6cc11751fc.svg","isPro":false,"fullname":"Chaojun XIAO","user":"xcjthu","type":"user"},{"_id":"6735ae0a092163a4c6088d13","avatarUrl":"/avatars/83c1b984f44264011eadf2558944be0e.svg","isPro":false,"fullname":"Zhao Yizhe","user":"PPETVER","type":"user"},{"_id":"64c5e944979493279b700cb2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vjFuPWw8Vl7b7gXB19Sk-.jpeg","isPro":false,"fullname":"Bingxiang He","user":"hbx","type":"user"},{"_id":"690029f20d55e24292e756da","avatarUrl":"/avatars/b84858af4aba09acc37cf86457bbe45f.svg","isPro":false,"fullname":"Yinuo Xu","user":"ng57683","type":"user"},{"_id":"64b2f97434a92b848c7e941e","avatarUrl":"/avatars/c699c50f3b43cd1641469521127753bb.svg","isPro":false,"fullname":"Nagori","user":"MohammedNaeem","type":"user"},{"_id":"697c8b15a7f796854ef333c4","avatarUrl":"/avatars/94de3a736fac914944f1b57609e3819a.svg","isPro":false,"fullname":"Joel Wang","user":"joelhenwang","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"62a1280b88bfb47fc40fe75b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62a1280b88bfb47fc40fe75b/u6teJWcB6BWdD04G7g6uy.png","isPro":false,"fullname":"Gabriel Mongaras","user":"gmongaras","type":"user"},{"_id":"663b307b95085055e932938e","avatarUrl":"/avatars/2b5d8fe8d5af85fb0c07a82bfed2af42.svg","isPro":false,"fullname":"Siyuan Zhang","user":"ZSYNOTZSH","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"633fe81429b5a95f6e16e34a","name":"openbmb","fullname":"OpenBMB","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670387859384-633fe7784b362488336bbfad.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15378.md","query":{}}">
Rethinking the Role of Efficient Attention in Hybrid Architectures
Abstract
Hybrid architectures combining full attention with efficient attention modules like sliding-window attention exhibit distinct scaling behaviors and optimization trajectories, with efficient attention primarily affecting the emergence speed of long-context capabilities rather than final performance.
Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance.
Community
Modern language models increasingly adopt hybrid architectures that combine full attention with efficient attention modules, such as sliding-window attention (SWA) and recurrent sequence mixers. However, how these efficient modules shape model capabilities remains poorly understood. To address this gap, we conduct a systematic analysis across hybrid architectures from three perspectives: scaling behavior, mechanism analysis, and architecture design. First, from a scaling perspective, we find that efficient-attention design primarily affects how fast long-context capability emerges, while different hybrids eventually converge to comparable long-context performance under sufficient training. Second, mechanistically, we show that long-range retrieval is mainly carried by full attention, whereas efficient attention shapes its optimization trajectory. This explains a counter-intuitive phenomenon we call Large-Window Laziness: larger SWA windows can delay the formation of retrieval heads in full-attention layers. Third, guided by this mechanism, we show that applying NoPE to only the full-attention layers of a small-window SWA hybrid substantially improves long-context performance with negligible impact on short-context performance
Neat paper. The finding that efficient attention modules shape the optimization trajectory rather than just performance is a cool perspective. It’s wild that a larger sliding window can actually delay the formation of retrieval heads in full-attention layers.
Does this imply that the "Large-Window Laziness" phenomenon could be avoided by staggering the training of different layers, or is it an inherent trade-off for these hybrid architectures?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/4c48aa6d-94a7-4721-aea5-d84471ff70a9
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.15378 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.15378 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.15378 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.