Hugging Face Daily Papers · June 10, 2026 · 3 min read

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6469e4ac4c1cd18b497537bb/0Y5Fs2VMQ7hYsac95-9bH.jpeg\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6469e4ac4c1cd18b497537bb/0Y5Fs2VMQ7hYsac95-9bH.jpeg\" alt=\"QKR-5\"></a></p>\n","updatedAt":"2026-06-10T04:14:50.676Z","author":{"_id":"6469e4ac4c1cd18b497537bb","avatarUrl":"/avatars/5149203a9015956578deaf3710c30cef.svg","fullname":"Zhou","name":"xinyu04","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3318535089492798},"editors":["xinyu04"],"editorAvatarUrls":["/avatars/5149203a9015956578deaf3710c30cef.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11052","authors":[{"_id":"6a28e3a3e7d78ea7587e5519","name":"Xinyu Zhou","hidden":false},{"_id":"6a28e3a3e7d78ea7587e551a","name":"Boyu Zhu","hidden":false},{"_id":"6a28e3a3e7d78ea7587e551b","name":"Yi Xu","hidden":false},{"_id":"6a28e3a3e7d78ea7587e551c","name":"Zhiwei Li","hidden":false},{"_id":"6a28e3a3e7d78ea7587e551d","name":"Yingfa Chen","hidden":false},{"_id":"6a28e3a3e7d78ea7587e551e","name":"Huiming Wang","hidden":false},{"_id":"6a28e3a3e7d78ea7587e551f","name":"Zhijiang Guo","hidden":false}],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It","submittedOnDailyBy":{"_id":"6469e4ac4c1cd18b497537bb","avatarUrl":"/avatars/5149203a9015956578deaf3710c30cef.svg","isPro":false,"fullname":"Zhou","user":"xinyu04","type":"user","name":"xinyu04"},"summary":"Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from 67.2% to 9.4%. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections (W_Q, W_K) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only W_Q and W_K from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from 65.4% to 76.4% while maintaining strong reasoning performance.","upvotes":12,"discussionId":"6a28e3a3e7d78ea7587e5520","ai_summary":"Chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, but a training-free method called QK-Restore can restore long-context capabilities by reverting query-key projections while preserving reasoning performance.","ai_keywords":["Chain-of-thought supervised fine-tuning","hybrid linear-attention models","long-context recall","Needle-In-A-Haystack","attention gradients","query-key projections","W_Q","W_K","QK-Restore","Procrustes variant","routing preservation","reasoning performance"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"6980a3aede8ee5f0a7de0007","name":"LARK-Lab","fullname":"LARK Lab@HKUST (GZ)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b6af3accebeadccc868efd/H6b3XExLG87O3ZFPV7Pr5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6469e4ac4c1cd18b497537bb","avatarUrl":"/avatars/5149203a9015956578deaf3710c30cef.svg","isPro":false,"fullname":"Zhou","user":"xinyu04","type":"user"},{"_id":"63b6af3accebeadccc868efd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b6af3accebeadccc868efd/cFTHKggMpsoaPe_46gcy9.webp","isPro":false,"fullname":"Zhijiang","user":"Zeee","type":"user"},{"_id":"64fed23f0871bc5930598ab5","avatarUrl":"/avatars/080a4ef3e4634cd978528dfa899a4eb0.svg","isPro":false,"fullname":"ZhiWei LI","user":"Aragonaa","type":"user"},{"_id":"670f609090379f8b59bf03d7","avatarUrl":"/avatars/d1c5b38fa744ef49c2a2aaceccb71615.svg","isPro":false,"fullname":"Zhu","user":"Boyu123","type":"user"},{"_id":"66273cd097b597050a8e7122","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Act6TH_qVX68zbo17BcLh.jpeg","isPro":false,"fullname":"Zhicheng YANG","user":"yangzhch6","type":"user"},{"_id":"668392c6286d8009d1496c2f","avatarUrl":"/avatars/02e43b1346cd932a80f1b794adf934a0.svg","isPro":false,"fullname":"yz","user":"yz1122","type":"user"},{"_id":"6980a631fea1db72ec8272db","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6980a631fea1db72ec8272db/DAkagzsz7DMRW3PhtqrmH.jpeg","isPro":false,"fullname":"Minrui Xu","user":"RolandXMR","type":"user"},{"_id":"67f8ccce9301e8cd1592b71f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/pmQUnMY4GqTYmm0K-_7BA.png","isPro":false,"fullname":"WangZilin","user":"terr1ble","type":"user"},{"_id":"643a587fe2b979ae6141b193","avatarUrl":"/avatars/1726b6a1629d800795f9bdf6d03ad190.svg","isPro":false,"fullname":"yilong xu","user":"sapphirex","type":"user"},{"_id":"667d0c1907023962fc64600a","avatarUrl":"/avatars/801ec1a4d496c51966db737e7c72ac85.svg","isPro":false,"fullname":"demi","user":"demi2222","type":"user"},{"_id":"644d55bd53ad80c659395473","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/644d55bd53ad80c659395473/Jdff2RyL6ZGhFSY8kab1L.jpeg","isPro":true,"fullname":"Yi Xu","user":"yixu1","type":"user"},{"_id":"68b2a4157f881fc640ba7d80","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/lMTgr3pe7pOHtMe7bVF7F.png","isPro":false,"fullname":"khtsly","user":"khtsly","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6980a3aede8ee5f0a7de0007","name":"LARK-Lab","fullname":"LARK Lab@HKUST (GZ)","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63b6af3accebeadccc868efd/H6b3XExLG87O3ZFPV7Pr5.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11052.md"}">

Papers

arxiv:2606.11052

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Published on Jun 9

· Submitted by

Zhou on Jun 10

LARK Lab@HKUST (GZ)

Upvote

Authors:

Abstract

Chain-of-thought supervised fine-tuning degrades long-context recall in hybrid linear-attention models by biasing attention gradients toward short-range patterns, but a training-free method called QK-Restore can restore long-context capabilities by reverting query-key projections while preserving reasoning performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from 67.2% to 9.4%. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections (W_Q, W_K) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only W_Q and W_K from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from 65.4% to 76.4% while maintaining strong reasoning performance.