Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.</p>\n","updatedAt":"2026-06-19T08:00:45.904Z","author":{"_id":"6426781213a9e5d9675a1d57","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vR7u0azaShh5Tj_OqdP2k.png","fullname":"ZHONGZHU ZHOU","name":"Zhongzhu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9008266925811768},"editors":["Zhongzhu"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vR7u0azaShh5Tj_OqdP2k.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.16429","authors":[{"_id":"6a34f72c4c5c5e0d69bf1dc6","name":"Zhongzhu Zhou","hidden":false},{"_id":"6a34f72c4c5c5e0d69bf1dc7","name":"Qingyang Wu","hidden":false},{"_id":"6a34f72c4c5c5e0d69bf1dc8","name":"Junxiong Wang","hidden":false},{"_id":"6a34f72c4c5c5e0d69bf1dc9","name":"Mayank Mishra","hidden":false},{"_id":"6a34f72c4c5c5e0d69bf1dca","name":"Shuaiwen Leon Song","hidden":false},{"_id":"6a34f72c4c5c5e0d69bf1dcb","name":"Ben Athiwaratkun","hidden":false},{"_id":"6a34f72c4c5c5e0d69bf1dcc","name":"Chenfeng Xu","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-19T00:00:00.000Z","title":"Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation","submittedOnDailyBy":{"_id":"6426781213a9e5d9675a1d57","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vR7u0azaShh5Tj_OqdP2k.png","isPro":false,"fullname":"ZHONGZHU ZHOU","user":"Zhongzhu","type":"user","name":"Zhongzhu"},"summary":"Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.","upvotes":1,"discussionId":"6a34f72d4c5c5e0d69bf1dcd","githubRepo":"https://github.com/FutureMLS-Lab/Taylor-Calibrate","githubRepoAddedBy":"user","ai_summary":"Hybrid linear attention models can be improved through a novel initialization technique that enhances conversion from pretrained Transformers by leveraging teacher attention statistics and alignment steps.","ai_keywords":["hybrid linear attention models","full softmax attention","Gated DeltaNet","teacher-student learning","Taylor-guided initialization","memory timescale","write gates","output gate","distillation tokens","attention projections"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"632b803bb2dd35f135623cc2","name":"togethercomputer","fullname":"Together","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67be52be391885e452f4f1be/XiCaXuCIs-N_03X_nzopf.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6426781213a9e5d9675a1d57","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vR7u0azaShh5Tj_OqdP2k.png","isPro":false,"fullname":"ZHONGZHU ZHOU","user":"Zhongzhu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"632b803bb2dd35f135623cc2","name":"togethercomputer","fullname":"Together","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67be52be391885e452f4f1be/XiCaXuCIs-N_03X_nzopf.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.16429.md","query":{}}">
Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation
Abstract
Hybrid linear attention models can be improved through a novel initialization technique that enhances conversion from pretrained Transformers by leveraging teacher attention statistics and alignment steps.
Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.
Community
Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.16429 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.16429 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.16429 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.