Hugging Face Daily Papers · May 15, 2026 · 9 min read

KL for a KL: On-Policy Distillation with Control Variate Baseline

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

nice work!\n","updatedAt":"2026-05-15T00:42:46.997Z","author":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","fullname":"Jongwon Lim","name":"Jongwondd","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9442720413208008},"editors":["Jongwondd"],"editorAvatarUrls":["/avatars/272750344d9c5afa38312f9814e390bb.svg"],"reactions":[],"isReport":false}},{"id":"6a066c41b6aa885f4f43721e","author":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","fullname":"Jongwon Lim","name":"Jongwondd","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-05-15T00:43:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.","html":"On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.\n","updatedAt":"2026-05-15T00:43:45.500Z","author":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","fullname":"Jongwon Lim","name":"Jongwondd","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.900103747844696},"editors":["Jongwondd"],"editorAvatarUrls":["/avatars/272750344d9c5afa38312f9814e390bb.svg"],"reactions":[],"isReport":false}},{"id":"6a067a984b1f02ed28d51405","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":355,"isUserFollowing":false},"createdAt":"2026-05-15T01:44:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes](https://huggingface.co/papers/2603.25562) (2026)\n* [Hybrid Policy Distillation for LLMs](https://huggingface.co/papers/2604.20244) (2026)\n* [Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation](https://huggingface.co/papers/2604.13010) (2026)\n* [A Survey of On-Policy Distillation for Large Language Models](https://huggingface.co/papers/2604.00626) (2026)\n* [SOD: Step-wise On-policy Distillation for Small Language Model Agents](https://huggingface.co/papers/2605.07725) (2026)\n* [MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate](https://huggingface.co/papers/2605.01347) (2026)\n* [OPSDL: On-Policy Self-Distillation for Long-Context Language Models](https://huggingface.co/papers/2604.17535) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2603.25562\">Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.20244\">Hybrid Policy Distillation for LLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.13010\">Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.00626\">A Survey of On-Policy Distillation for Large Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07725\">SOD: Step-wise On-policy Distillation for Small Language Model Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.01347\">MAD-OPD: Breaking the Ceiling in On-Policy Distillation via Multi-Agent Debate</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.17535\">OPSDL: On-Policy Self-Distillation for Long-Context Language Models</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-15T01:44:56.482Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":355,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7198691368103027},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.07865","authors":[{"_id":"6a01949a675c142cf74adc62","name":"Minjae Oh","hidden":false},{"_id":"6a01949a675c142cf74adc63","name":"Sangjun Song","hidden":false},{"_id":"6a01949a675c142cf74adc64","name":"Gyubin Choi","hidden":false},{"_id":"6a01949a675c142cf74adc65","user":{"_id":"67371adc7ef9698051041c58","avatarUrl":"/avatars/4a1f58e390421dbd19cb13a4f06ec3e6.svg","isPro":false,"fullname":"Choi","user":"yunhowhour","type":"user","name":"yunhowhour"},"name":"Yunho Choi","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:59:11.084Z","hidden":false},{"_id":"6a01949a675c142cf74adc66","name":"Yohan Jo","hidden":false}],"publishedAt":"2026-05-08T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"KL for a KL: On-Policy Distillation with Control Variate Baseline","submittedOnDailyBy":{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","isPro":false,"fullname":"Jongwon Lim","user":"Jongwondd","type":"user","name":"Jongwondd"},"summary":"On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.","upvotes":14,"discussionId":"6a01949a675c142cf74adc67","ai_summary":"On-Policy Distillation with control variate baseline stabilizes training through policy-gradient reinforcement learning techniques while maintaining efficiency and performance.","ai_keywords":["On-Policy Distillation","Monte Carlo estimator","gradient variance","control variate baseline","policy-gradient RL","value function","reverse KL divergence","token-level reverse KL","top-k approximation","variance reduction"],"organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64bce212565b827f7ea20921","avatarUrl":"/avatars/4e215d599b577a85b5a7968e76eff1d5.svg","isPro":false,"fullname":"gyubin choi","user":"gyubinc","type":"user"},{"_id":"69bd03325cb8f0d62bf56ef3","avatarUrl":"/avatars/272750344d9c5afa38312f9814e390bb.svg","isPro":false,"fullname":"Jongwon Lim","user":"Jongwondd","type":"user"},{"_id":"67371adc7ef9698051041c58","avatarUrl":"/avatars/4a1f58e390421dbd19cb13a4f06ec3e6.svg","isPro":false,"fullname":"Choi","user":"yunhowhour","type":"user"},{"_id":"66e80dc9007ee12d2a8bd5ae","avatarUrl":"/avatars/dfb867009e352a43a003081503e7072c.svg","isPro":false,"fullname":"Hyeongjin Kim","user":"madokalif","type":"user"},{"_id":"650fcd442a45730c3ffcbdb6","avatarUrl":"/avatars/17d7de506dcf870d35fdcd0ddd5cc2ee.svg","isPro":false,"fullname":"Heejae Suh","user":"boribori","type":"user"},{"_id":"669f80b63d38b52c79bdf8fc","avatarUrl":"/avatars/1a88d11c1408c1373ba148e186e3a0f1.svg","isPro":false,"fullname":"sungjiblim","user":"sungzip","type":"user"},{"_id":"67dd45f1a412018fab2705ae","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/FfOX4wkw4Zirw2O9Bdd4T.png","isPro":false,"fullname":"holilab","user":"holi-lab","type":"user"},{"_id":"62b45cec2aeef077f1e42697","avatarUrl":"/avatars/91f8570448d8a2a407eb87ef5c072b21.svg","isPro":false,"fullname":"Hacastle12","user":"hacastle12","type":"user"},{"_id":"67e62e2e85286d639823ee15","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/hMXbFXaG4bHNLo0QuEvC1.png","isPro":false,"fullname":"SeungWon Kook","user":"Aiant56","type":"user"},{"_id":"6552f9e2ab7c20ac6fe7e556","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6552f9e2ab7c20ac6fe7e556/WYErK8nUyPXn4QmfhNlze.jpeg","isPro":false,"fullname":"John","user":"johnhan00","type":"user"},{"_id":"69e991e47d22f27adde7f518","avatarUrl":"/avatars/560c40b29721cd31558f49c5c7e1f797.svg","isPro":false,"fullname":"pikachu","user":"optimized-pikachu","type":"user"},{"_id":"66ac7b0997a8c9192bc551df","avatarUrl":"/avatars/41e9d93cde502e8235f9c8bd20be89cc.svg","isPro":false,"fullname":"Sangjun Song","user":"ssangjun706","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66d54dc8033492801db2bf5a","name":"SeoulNatlUniv","fullname":"Seoul National University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/659ccc9d18897eb6594e897f/_-0BM-1UyM-d-lRiahFnf.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.07865.md"}">

Papers

arxiv:2605.07865

KL for a KL: On-Policy Distillation with Control Variate Baseline

Published on May 8

· Submitted by

Jongwon Lim on May 14

Seoul National University

Upvote

Authors:

Yunho Choi ,

Abstract

On-Policy Distillation with control variate baseline stabilizes training through policy-gradient reinforcement learning techniques while maintaining efficiency and performance.

AI-generated summary

On-Policy Distillation (OPD) has emerged as a dominant post-training paradigm for large language models, especially for reasoning domains. However, OPD remains unstable in practice due to the high gradient variance of its single-sample Monte Carlo estimator, and recipes for stable training are still immature. We propose vOPD (On-Policy Distillation with a control variate baseline), which casts OPD as policy-gradient RL and stabilizes it by introducing a control variate baseline-canonically a value function -- from the RL literature. We show that the OPD value function admits a closed form as the per-token negative reverse KL divergence between the student and the teacher, available directly from the already-computed forward pass with no additional critic or inference. Existing stabilization methods either compute the full token-level reverse KL over the entire vocabulary, adding significant overhead, or restrict it to a top-k support, biasing the objective. vOPD instead preserves the lightweight single-sample estimator, subtracting the value function as a detached baseline to keep the gradient unbiased while reducing variance. Furthermore, we show that a top-k approximation of the baseline further lowers cost without compromising performance. Across mathematical and scientific reasoning benchmarks, vOPD consistently outperforms vanilla OPD and matches the most expensive full-vocabulary baseline, offering an efficient stabilization of On-Policy Distillation through principled RL variance reduction.