Hugging Face Daily Papers · · 9 min read

KL for a KL: On-Policy Distillation with Control Variate Baseline

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

nice work!</p>\n","updatedAt":"2026-05-15T00:42:46.997Z","author":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","fullname":"Jongwon Lim","name":"Jongwondd","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9442720413208008},"editors":["Jongwondd"],"editorAvatarUrls":["/avatars/272750344d9c5afa38312f9814e390bb.svg"],"reactions":[],"isReport":false}},{"id":"6a066c41b6aa885f4f43721e","author":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","fullname":"Jongwon Lim","name":"Jongwondd","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-05-15T00:43:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.","html":"<p>On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.</p>\n","updatedAt":"2026-05-15T00:43:45.500Z","author":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","fullname":"Jongwon Lim","name":"Jongwondd","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.900103747844696},"editors":["Jongwondd"],"editorAvatarUrls":["/avatars/272750344d9c5afa38312f9814e390bb.svg"],"reactions":[],"isReport":false}},{"id":"6a067a984b1f02ed28d51405","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":355,"isUserFollowing":false},"createdAt":"2026-05-15T01:44:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes](https://huggingface.co/papers/2603.25562) (2026)\n* [Hybrid Policy Distillation for LLMs](https://huggingface.co/papers/2604.20244) (2026)\n* [Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation](https://huggingface.co/papers/2604.13010) (2026)\n* [A Survey of On-Policy Distillation for Large Language Models](https://huggingface.co/papers/2604.00626) (2026)\n* [SOD: Step-wise On-policy Distillation for Small Language Model Agents](https://huggingface.co/papers/2605.07725) (2026)\n* [MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate](https://huggingface.co/papers/2605.01347) (2026)\n* [OPSDL: On-Policy Self-Distillation for Long-Context Language Models](https://huggingface.co/papers/2604.17535) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2603.25562\">Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.20244\">Hybrid Policy Distillation for LLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.13010\">Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.00626\">A Survey of On-Policy Distillation for Large Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07725\">SOD: Step-wise On-policy Distillation for Small Language Model Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.01347\">MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.17535\">OPSDL: On-Policy Self-Distillation for Long-Context Language Models</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-15T01:44:56.482Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":355,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7198691368103027},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.07865","authors":[{"_id":"6a01949a675c142cf74adc62","name":"Minjae Oh","hidden":false},{"_id":"6a01949a675c142cf74adc63","name":"Sangjun Song","hidden":false},{"_id":"6a01949a675c142cf74adc64","name":"Gyubin Choi","hidden":false},{"_id":"6a01949a675c142cf74adc65","user":{"_id":"67371adc7ef9698051041c58","avatarUrl":"/avatars/4a1f58e390421dbd19cb13a4f06ec3e6.svg","isPro":false,"fullname":"Choi","user":"yunhowhour","type":"user","name":"yunhowhour"},"name":"Yunho Choi","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:59:11.084Z","hidden":false},{"_id":"6a01949a675c142cf74adc66","name":"Yohan Jo","hidden":false}],"publishedAt":"2026-05-08T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"KL for a KL: On-Policy Distillation with Control Variate Baseline","submittedOnDailyBy":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","isPro":false,"fullname":"Jongwon Lim","user":"Jongwondd","type":"user","name":"Jongwondd"},"summary":"On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.","upvotes":14,"discussionId":"6a01949a675c142cf74adc67","ai_summary":"On-Policy Distillation with control variate baseline stabilizes training through policy-gradient reinforcement learning techniques while maintaining efficiency and performance.","ai_keywords":["On-Policy Distillation","Monte Carlo estimator","gradient variance","control variate baseline","policy-gradient RL","value function","reverse KL divergence","token-level reverse KL","top-k approximation","variance reduction"],"organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64bce212565b827f7ea20921","avatarUrl":"/avatars/4e215d599b577a85b5a7968e76eff1d5.svg","isPro":false,"fullname":"gyubin choi","user":"gyubinc","type":"user"},{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","isPro":false,"fullname":"Jongwon Lim","user":"Jongwondd","type":"user"},{"_id":"67371adc7ef9698051041c58","avatarUrl":"/avatars/4a1f58e390421dbd19cb13a4f06ec3e6.svg","isPro":false,"fullname":"Choi","user":"yunhowhour","type":"user"},{"_id":"66e80dc9007ee12d2a8bd5ae","avatarUrl":"/avatars/dfb867009e352a43a003081503e7072c.svg","isPro":false,"fullname":"Hyeongjin Kim","user":"madokalif","type":"user"},{"_id":"650fcd442a45730c3ffcbdb6","avatarUrl":"/avatars/17d7de506dcf870d35fdcd0ddd5cc2ee.svg","isPro":false,"fullname":"Heejae Suh","user":"boribori","type":"user"},{"_id":"669f80b63d38b52c79bdf8fc","avatarUrl":"/avatars/1a88d11c1408c1373ba148e186e3a0f1.svg","isPro":false,"fullname":"sungjiblim","user":"sungzip","type":"user"},{"_id":"67dd45f1a412018fab2705ae","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/FfOX4wkw4Zirw2O9Bdd4T.png","isPro":false,"fullname":"holilab","user":"holi-lab","type":"user"},{"_id":"62b45cec2aeef077f1e42697","avatarUrl":"/avatars/91f8570448d8a2a407eb87ef5c072b21.svg","isPro":false,"fullname":"Hacastle12","user":"hacastle12","type":"user"},{"_id":"67e62e2e85286d639823ee15","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/hMXbFXaG4bHNLo0QuEvC1.png","isPro":false,"fullname":"SeungWon Kook","user":"Aiant56","type":"user"},{"_id":"6552f9e2ab7c20ac6fe7e556","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6552f9e2ab7c20ac6fe7e556/WYErK8nUyPXn4QmfhNlze.jpeg","isPro":false,"fullname":"John","user":"johnhan00","type":"user"},{"_id":"69e991e47d22f27adde7f518","avatarUrl":"/avatars/560c40b29721cd31558f49c5c7e1f797.svg","isPro":false,"fullname":"pikachu","user":"optimized-pikachu","type":"user"},{"_id":"66ac7b0997a8c9192bc551df","avatarUrl":"/avatars/41e9d93cde502e8235f9c8bd20be89cc.svg","isPro":false,"fullname":"Sangjun Song","user":"ssangjun706","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.07865.md"}">
Papers
arxiv:2605.07865

KL for a KL: On-Policy Distillation with Control Variate Baseline

Published on May 8
· Submitted by
Jongwon Lim
on May 14
Authors:
,
,
,

Abstract

On-Policy Distillation with control variate baseline stabilizes training through policy-gradient reinforcement learning techniques while maintaining efficiency and performance.

AI-generated summary

On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.

Community

Paper submitter about 1 hour ago

nice work!

Paper submitter about 1 hour ago

On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.07865
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.07865 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.07865 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.07865 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers