On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.</p>\n","updatedAt":"2026-06-03T02:04:23.574Z","author":{"_id":"651c4e0bc0247c08a46ab2a6","avatarUrl":"/avatars/3396a34ffb400f576371afc8a5064783.svg","fullname":"xxr","name":"xrxing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8925026059150696},"editors":["xrxing"],"editorAvatarUrls":["/avatars/3396a34ffb400f576371afc8a5064783.svg"],"reactions":[{"reaction":"👍","users":["divinezeng","yutyang1","AisDante"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01249","authors":[{"_id":"6a1f8b43e292c1c78ecb12cd","name":"Xingrun Xing","hidden":false},{"_id":"6a1f8b43e292c1c78ecb12ce","name":"Haoqing Wang","hidden":false},{"_id":"6a1f8b43e292c1c78ecb12cf","name":"Boyan Gao","hidden":false},{"_id":"6a1f8b43e292c1c78ecb12d0","name":"Ziheng Li","hidden":false},{"_id":"6a1f8b43e292c1c78ecb12d1","name":"Yehui Tang","hidden":false}],"publishedAt":"2026-05-31T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"Trust Region On-Policy Distillation","submittedOnDailyBy":{"_id":"651c4e0bc0247c08a46ab2a6","avatarUrl":"/avatars/3396a34ffb400f576371afc8a5064783.svg","isPro":false,"fullname":"xxr","user":"xrxing","type":"user","name":"xrxing"},"summary":"On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.","upvotes":29,"discussionId":"6a1f8b43e292c1c78ecb12d2","ai_summary":"Trust Region On-Policy Distillation (TrOPD) improves reliable token-level supervision in large language model distillation by using trust regions, outlier estimation, and off-policy guidance to address instability issues under distribution mismatch.","ai_keywords":["on-policy distillation","trust region","reverse-KL estimator","gradient clipping","forward-KL estimation","off-policy guidance","token-level supervision","distribution mismatch","Kullback-Leibler divergence"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"686df54910a52f2c2cf03c06","name":"SamsungResearch","fullname":"Samsung Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60ffc3e62403168abcae811d/lBrkzrpjrJ8k-3CGLKRLr.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"636a7459eb076ec3f4030e7d","avatarUrl":"/avatars/832dc709211e3a2ea5e93caea3768122.svg","isPro":false,"fullname":"Ziheng Li","user":"ChillingDream","type":"user"},{"_id":"672cbd2944096b60a3f82ea5","avatarUrl":"/avatars/e9667110fb8a1de2ba9259b43819b592.svg","isPro":false,"fullname":"ning wang","user":"fisalt2","type":"user"},{"_id":"643a587fe2b979ae6141b193","avatarUrl":"/avatars/1726b6a1629d800795f9bdf6d03ad190.svg","isPro":false,"fullname":"yilong xu","user":"sapphirex","type":"user"},{"_id":"67173a647d3dcc0f999be82c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/SBApuewplE4DFkinlm5wI.png","isPro":false,"fullname":"Siyu Ding","user":"Abel2076","type":"user"},{"_id":"649d96407b4d8f568b49f472","avatarUrl":"/avatars/c0ca58462749108b5db9538b4be6d9ac.svg","isPro":false,"fullname":"Duanyu Feng","user":"ColFeng","type":"user"},{"_id":"641fd72a73cfc036ddbf69c8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641fd72a73cfc036ddbf69c8/W4HG5HRm-OkzVpOotgu3m.jpeg","isPro":false,"fullname":"WenYang","user":"James-WYang","type":"user"},{"_id":"626ac1f8822e3d85324959ef","avatarUrl":"/avatars/76bdbe022dc3a12923566cabd2190ae7.svg","isPro":false,"fullname":"Xiang Long","user":"swordfaith","type":"user"},{"_id":"68f5fb430a62655974005707","avatarUrl":"/avatars/04d7d4aef7a0d84e6e7eae3780c250d9.svg","isPro":false,"fullname":"Haoqing wang","user":"Jackwang111","type":"user"},{"_id":"6882409cd6ab89c018dd25da","avatarUrl":"/avatars/358241c11a589199650f3c06e15763d6.svg","isPro":false,"fullname":"Changjjang Zhou","user":"Sherlocoder1","type":"user"},{"_id":"6607f2e84fa3a72a972cdbfd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6607f2e84fa3a72a972cdbfd/zSqf3mZxpVCmZUopL4Dgn.png","isPro":false,"fullname":"Weihua Kuang","user":"weihua-kuang","type":"user"},{"_id":"68b80d195038aabf1e0cd726","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/8Lc3EPae0Kp_Yig74uMow.png","isPro":false,"fullname":"Ning Ding","user":"Ning-Ding","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":1,"organization":{"_id":"686df54910a52f2c2cf03c06","name":"SamsungResearch","fullname":"Samsung Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60ffc3e62403168abcae811d/lBrkzrpjrJ8k-3CGLKRLr.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.01249.md"}">
Trust Region On-Policy Distillation
Abstract
Trust Region On-Policy Distillation (TrOPD) improves reliable token-level supervision in large language model distillation by using trust regions, outlier estimation, and off-policy guidance to address instability issues under distribution mismatch.
On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.
Community
On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.01249 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.01249 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.01249 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.