nice work!</p>\n","updatedAt":"2026-05-15T00:42:46.997Z","author":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","fullname":"Jongwon Lim","name":"Jongwondd","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9442720413208008},"editors":["Jongwondd"],"editorAvatarUrls":["/avatars/272750344d9c5afa38312f9814e390bb.svg"],"reactions":[],"isReport":false}},{"id":"6a066c41b6aa885f4f43721e","author":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","fullname":"Jongwon Lim","name":"Jongwondd","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-05-15T00:43:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.","html":"<p>On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.</p>\n","updatedAt":"2026-05-15T00:43:45.500Z","author":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","fullname":"Jongwon Lim","name":"Jongwondd","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.900103747844696},"editors":["Jongwondd"],"editorAvatarUrls":["/avatars/272750344d9c5afa38312f9814e390bb.svg"],"reactions":[],"isReport":false}},{"id":"6a067a984b1f02ed28d51405","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":355,"isUserFollowing":false},"createdAt":"2026-05-15T01:44:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes](https://huggingface.co/papers/2603.25562) (2026)\n* [Hybrid Policy Distillation for LLMs](https://huggingface.co/papers/2604.20244) (2026)\n* [Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation](https://huggingface.co/papers/2604.13010) (2026)\n* [A Survey of On-Policy Distillation for Large Language Models](https://huggingface.co/papers/2604.00626) (2026)\n* [SOD: Step-wise On-policy Distillation for Small Language Model Agents](https://huggingface.co/papers/2605.07725) (2026)\n* [MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate](https://huggingface.co/papers/2605.01347) (2026)\n* [OPSDL: On-Policy Self-Distillation for Long-Context Language Models](https://huggingface.co/papers/2604.17535) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2603.25562\">Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.20244\">Hybrid Policy Distillation for LLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.13010\">Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.00626\">A Survey of On-Policy Distillation for Large Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07725\">SOD: Step-wise On-policy Distillation for Small Language Model Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.01347\">MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.17535\">OPSDL: On-Policy Self-Distillation for Long-Context Language Models</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-15T01:44:56.482Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":355,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7198691368103027},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.07865","authors":[{"_id":"6a01949a675c142cf74adc62","name":"Minjae Oh","hidden":false},{"_id":"6a01949a675c142cf74adc63","name":"Sangjun Song","hidden":false},{"_id":"6a01949a675c142cf74adc64","name":"Gyubin Choi","hidden":false},{"_id":"6a01949a675c142cf74adc65","user":{"_id":"67371adc7ef9698051041c58","avatarUrl":"/avatars/4a1f58e390421dbd19cb13a4f06ec3e6.svg","isPro":false,"fullname":"Choi","user":"yunhowhour","type":"user","name":"yunhowhour"},"name":"Yunho Choi","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:59:11.084Z","hidden":false},{"_id":"6a01949a675c142cf74adc66","name":"Yohan Jo","hidden":false}],"publishedAt":"2026-05-08T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"KL for a KL: On-Policy Distillation with Control Variate Baseline","submittedOnDailyBy":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","isPro":false,"fullname":"Jongwon Lim","user":"Jongwondd","type":"user","name":"Jongwondd"},"summary":"On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.","upvotes":14,"discussionId":"6a01949a675c142cf74adc67","ai_summary":"On-Policy Distillation with control variate baseline stabilizes training through policy-gradient reinforcement learning techniques while maintaining efficiency and performance.","ai_keywords":["On-Policy Distillation","Monte Carlo estimator","gradient variance","control variate baseline","policy-gradient RL","value function","reverse KL divergence","token-level reverse KL","top-k approximation","variance reduction"],"organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64bce212565b827f7ea20921","avatarUrl":"/avatars/4e215d599b577a85b5a7968e76eff1d5.svg","isPro":false,"fullname":"gyubin choi","user":"gyubinc","type":"user"},{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","isPro":false,"fullname":"Jongwon Lim","user":"Jongwondd","type":"user"},{"_id":"67371adc7ef9698051041c58","avatarUrl":"/avatars/4a1f58e390421dbd19cb13a4f06ec3e6.svg","isPro":false,"fullname":"Choi","user":"yunhowhour","type":"user"},{"_id":"66e80dc9007ee12d2a8bd5ae","avatarUrl":"/avatars/dfb867009e352a43a003081503e7072c.svg","isPro":false,"fullname":"Hyeongjin Kim","user":"madokalif","type":"user"},{"_id":"650fcd442a45730c3ffcbdb6","avatarUrl":"/avatars/17d7de506dcf870d35fdcd0ddd5cc2ee.svg","isPro":false,"fullname":"Heejae Suh","user":"boribori","type":"user"},{"_id":"669f80b63d38b52c79bdf8fc","avatarUrl":"/avatars/1a88d11c1408c1373ba148e186e3a0f1.svg","isPro":false,"fullname":"sungjiblim","user":"sungzip","type":"user"},{"_id":"67dd45f1a412018fab2705ae","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/FfOX4wkw4Zirw2O9Bdd4T.png","isPro":false,"fullname":"holilab","user":"holi-lab","type":"user"},{"_id":"62b45cec2aeef077f1e42697","avatarUrl":"/avatars/91f8570448d8a2a407eb87ef5c072b21.svg","isPro":false,"fullname":"Hacastle12","user":"hacastle12","type":"user"},{"_id":"67e62e2e85286d639823ee15","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/hMXbFXaG4bHNLo0QuEvC1.png","isPro":false,"fullname":"SeungWon Kook","user":"Aiant56","type":"user"},{"_id":"6552f9e2ab7c20ac6fe7e556","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6552f9e2ab7c20ac6fe7e556/WYErK8nUyPXn4QmfhNlze.jpeg","isPro":false,"fullname":"John","user":"johnhan00","type":"user"},{"_id":"69e991e47d22f27adde7f518","avatarUrl":"/avatars/560c40b29721cd31558f49c5c7e1f797.svg","isPro":false,"fullname":"pikachu","user":"optimized-pikachu","type":"user"},{"_id":"66ac7b0997a8c9192bc551df","avatarUrl":"/avatars/41e9d93cde502e8235f9c8bd20be89cc.svg","isPro":false,"fullname":"Sangjun Song","user":"ssangjun706","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.07865.md"}">
KL for a KL: On-Policy Distillation with Control Variate Baseline
Abstract
On-Policy Distillation with control variate baseline stabilizes training through policy-gradient reinforcement learning techniques while maintaining efficiency and performance.
AI-generated summary
On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.
Community
On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.07865 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.07865 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.07865 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.