Hugging Face Daily Papers · June 23, 2026 · 4 min read

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization</p>\n","updatedAt":"2026-06-23T02:54:14.051Z","author":{"_id":"632808aeeeee4dd858082d40","avatarUrl":"/avatars/f93e9cd6e88e1740e0a4183314d3969f.svg","fullname":"zhentao tan","name":"tzt","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8237533569335938},"editors":["tzt"],"editorAvatarUrls":["/avatars/f93e9cd6e88e1740e0a4183314d3969f.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.20097","authors":[{"_id":"6a389bb0db23715e9da138ec","user":{"_id":"632808aeeeee4dd858082d40","avatarUrl":"/avatars/f93e9cd6e88e1740e0a4183314d3969f.svg","isPro":false,"fullname":"zhentao tan","user":"tzt","type":"user","name":"tzt"},"name":"Zhentao Tan","status":"claimed_verified","statusLastChangedAt":"2026-06-22T16:12:17.186Z","hidden":false},{"_id":"6a389bb0db23715e9da138ed","name":"Wei Chen","hidden":false},{"_id":"6a389bb0db23715e9da138ee","name":"Jingyi Shen","hidden":false},{"_id":"6a389bb0db23715e9da138ef","name":"Yao Liu","hidden":false},{"_id":"6a389bb0db23715e9da138f0","name":"Xu Shen","hidden":false},{"_id":"6a389bb0db23715e9da138f1","name":"Yue Wu","hidden":false},{"_id":"6a389bb0db23715e9da138f2","name":"Jieping Ye","hidden":false}],"publishedAt":"2026-06-18T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization","submittedOnDailyBy":{"_id":"632808aeeeee4dd858082d40","avatarUrl":"/avatars/f93e9cd6e88e1740e0a4183314d3969f.svg","isPro":false,"fullname":"zhentao tan","user":"tzt","type":"user","name":"tzt"},"summary":"The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.","upvotes":17,"discussionId":"6a389bb1db23715e9da138f3","ai_summary":"HydraHead is a novel attention hybridization architecture that combines Full Attention and Linear Attention at the head level, achieving superior long-context performance with reduced training overhead through interpretability-driven selection and scale-normalized fusion.","ai_keywords":["attention hybridization","Linear Attention","Full Attention","head-level hybridization","interpretability-driven selection","scale-normalized fusion","parameter reuse","distillation","long-context processing","attention mechanisms"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"663dd1322bc4b358bcc8b535","avatarUrl":"/avatars/1a629b1314cf652ad33e15252201409a.svg","isPro":false,"fullname":"shumo","user":"ali-shumo","type":"user"},{"_id":"632808aeeeee4dd858082d40","avatarUrl":"/avatars/f93e9cd6e88e1740e0a4183314d3969f.svg","isPro":false,"fullname":"zhentao tan","user":"tzt","type":"user"},{"_id":"63fdfd4c0c1bbe8e29d211e2","avatarUrl":"/avatars/03615db4019204cae819052f5d721b4b.svg","isPro":false,"fullname":"Yue Wu","user":"matthewwy","type":"user"},{"_id":"6569a78844ce94a7017770dd","avatarUrl":"/avatars/e457adf2b42594858b844fbf60a08354.svg","isPro":false,"fullname":"Jiaqi Gu","user":"gujiaqivadin","type":"user"},{"_id":"6a1afac28e92da1d819b1c13","avatarUrl":"/avatars/90bd7e339e869a0584c97ed72d5fcc49.svg","isPro":false,"fullname":"tqsi","user":"gravitysi","type":"user"},{"_id":"687dd33a032351d7de984021","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/bxiR3jSzJ4RSzxyBaloAc.png","isPro":false,"fullname":"ShenCao","user":"Chaos-2025","type":"user"},{"_id":"6561a809b29be3f5b6aaf6e3","avatarUrl":"/avatars/368eaa62cff83dfb57b430cec907b6a5.svg","isPro":false,"fullname":"yejibing","user":"yejibing","type":"user"},{"_id":"6567055af450504854f9aaad","avatarUrl":"/avatars/682d59300d67e99d8423428d8df8779d.svg","isPro":false,"fullname":"Yuhao Shi","user":"GeniusJK-IKUN","type":"user"},{"_id":"68b508e4d60feb3896e1182a","avatarUrl":"/avatars/b29e6ed9b49c21baeace868b9851f4b0.svg","isPro":false,"fullname":"Shen Jingyi","user":"shenjingyi","type":"user"},{"_id":"654dd159023b93ef6a219ea7","avatarUrl":"/avatars/848e5c44128a5247911f68eb47dd6232.svg","isPro":false,"fullname":"su","user":"su1093","type":"user"},{"_id":"658974d16b17c06872e2faa3","avatarUrl":"/avatars/d31cbf65b573addcaed0223beabfb2f1.svg","isPro":false,"fullname":"liu junfeng","user":"fengjliu","type":"user"},{"_id":"6530c6d8a6a6f2be6f2edf6e","avatarUrl":"/avatars/e59dbcaf4b0034200538c3226926897a.svg","isPro":false,"fullname":"jianghaishu","user":"jianghaishu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.20097.md","query":{}}">

Papers

arxiv:2606.20097

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

Published on Jun 18

· Submitted by

zhentao tan on Jun 23

Upvote

Authors:

Zhentao Tan ,

Abstract

HydraHead is a novel attention hybridization architecture that combines Full Attention and Linear Attention at the head level, achieving superior long-context performance with reduced training overhead through interpretability-driven selection and scale-normalized fusion.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

The quadratic complexity of attention poses a critical bottleneck for long-context processing, spurring interest in hybrid attention designs. Most open-source hybrid models adopt a layer-wise strategy. Yet, prior work has noted the inherent difficulty of integrating Linear Attention (LA) with Full Attention (FA), suggesting that the design space of attention hybridization remains underexplored. To probe this space, we conduct interpretability analysis and observe that layers exhibit block-wise functional similarity, while individual heads within the same layer display distinct functional specialization despite sharing input features. This head-level heterogeneity suggests that the head dimension provides a natural and principled granularity for fusing heterogeneous attention signals. Building on this insight, we introduce HydraHead, a novel architecture that hybridizes FA and LA along the head axis. HydraHead features two key innovations: (1) an interpretability-driven selection strategy that identifies retrieval-critical heads and preserves FA only for them, and (2) a scale-normalized fusion module that reconciles the distributional gap between FA and LA head outputs. By leveraging a three-stage transfer pipeline with parameter reuse and distillation, we achieve high-performance hybrid models with minimal training overhead. Under a unified training setup, HydraHead outperforms other hybrid designs in long-context tasks while maintaining strong general reasoning. With interpretability-driven head selection, it matches a 3:1 layer-wise hybrid's long-context performance at a 7:1 LA-to-FA ratio. Crucially, trained on only 15B tokens, HydraHead achieves over 69% improvement over the baseline at 512K context length, approaching Qwen3.5, a leading model of comparable size with a native context length of 256K. This highlights the significant scaling potential of head-level hybridization.

View arXiv page View PDF Add to collection

Community

tzt

Paper author Paper submitter about 22 hours ago

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.20097

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.20097 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.20097 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.20097 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers