Hugging Face Daily Papers · May 21, 2026 · 7 min read

Learning from Language Feedback via Variational Policy Distillation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Variational Policy Distillation (VPD) addresses a key limitation of reinforcement learning from verifiable rewards (RLVR): the binary reward signal discards all information from near-miss failures. A coding solution that fails 1 test out of 50 gets the same reward as random noise, even though the compiler error tells you exactly what went wrong.\nVPD formalizes learning from language feedback (compiler errors, LLM critiques, self-corrections) as a variational EM problem. Unlike prior self-distillation methods that treat the feedback-conditioned teacher as a frozen function, VPD co-trains the teacher and student in an alternating loop:\n<ul>\n<li>E-step: refine the teacher's ability to interpret feedback via preference optimization</li>\n<li>M-step: distill the improved teacher into the student on its own rollouts</li>\n</ul>\nBoth share a single network, so there's zero additional memory overhead.\nWe evaluate on 3 model families (Qwen3-4B, Qwen3-8B, Llama-3.1-8B) across code generation (LiveCodeBench) and scientific reasoning (SciKnowEval). VPD consistently improves over GRPO and self-distillation baselines, with notably more stable training dynamics. We also characterize where the approach has limitations — on strict mathematical reasoning where error feedback is less informative, standard RL remains stronger.\nHappy to discuss — feedback welcome!\n","updatedAt":"2026-05-21T18:42:29.534Z","author":{"_id":"6669c5bc2a7fa8d9cf7012cd","avatarUrl":"/avatars/06776001dfb4ed71033b7eb55c708e6e.svg","fullname":"Yang Li","name":"yli-ml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8877630829811096},"editors":["yli-ml"],"editorAvatarUrls":["/avatars/06776001dfb4ed71033b7eb55c708e6e.svg"],"reactions":[],"isReport":false}},{"id":"6a0fb5d824b7eb8cf638e386","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":358,"isUserFollowing":false},"createdAt":"2026-05-22T01:48:08.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Self-Distilled RLVR](https://huggingface.co/papers/2604.03128) (2026)\n* [SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting](https://huggingface.co/papers/2604.10688) (2026)\n* [SOD: Step-wise On-policy Distillation for Small Language Model Agents](https://huggingface.co/papers/2605.07725) (2026)\n* [Policy Improvement Reinforcement Learning](https://huggingface.co/papers/2604.00860) (2026)\n* [Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing](https://huggingface.co/papers/2605.05940) (2026)\n* [Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation](https://huggingface.co/papers/2605.12741) (2026)\n* [VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation](https://huggingface.co/papers/2603.26666) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.03128\">Self-Distilled RLVR</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.10688\">SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07725\">SOD: Step-wise On-policy Distillation for Small Language Model Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.00860\">Policy Improvement Reinforcement Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.05940\">Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12741\">Learning with Rare Success but Rich Feedback via Reflection-Enhanced Self-Distillation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2603.26666\">VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-22T01:48:08.782Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":358,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7229341268539429},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15113","authors":[{"_id":"6a0f5172a53a61ce2e422b1d","name":"Yang Li","hidden":false},{"_id":"6a0f5172a53a61ce2e422b1e","name":"Erik Nijkamp","hidden":false},{"_id":"6a0f5172a53a61ce2e422b1f","name":"Semih Yavuz","hidden":false},{"_id":"6a0f5172a53a61ce2e422b20","name":"Shafiq Joty","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"Learning from Language Feedback via Variational Policy Distillation","submittedOnDailyBy":{"_id":"6669c5bc2a7fa8d9cf7012cd","avatarUrl":"/avatars/06776001dfb4ed71033b7eb55c708e6e.svg","isPro":false,"fullname":"Yang Li","user":"yli-ml","type":"user","name":"yli-ml"},"summary":"Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.","upvotes":6,"discussionId":"6a0f5172a53a61ce2e422b21","ai_summary":"Variational Policy Distillation enables reinforcement learning from language feedback by co-evolving teacher and student policies through variational expectation-maximization, overcoming limitations of passive distillation in complex reasoning tasks.","ai_keywords":["reinforcement learning from verifiable rewards","on-policy self-distillation","language feedback","token-level supervision","variational expectation-maximization","policy distillation","trust-region update","dense distributional guidance","scientific reasoning","code generation"],"organization":{"_id":"5f6d64475e78cc6b0ed31e4c","name":"Salesforce","fullname":"Salesforce AI Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1602756670970-noauth.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6669c5bc2a7fa8d9cf7012cd","avatarUrl":"/avatars/06776001dfb4ed71033b7eb55c708e6e.svg","isPro":false,"fullname":"Yang Li","user":"yli-ml","type":"user"},{"_id":"666b92d874a5f556825b98b0","avatarUrl":"/avatars/86a4690546b40db2b4f50d370f164356.svg","isPro":false,"fullname":"Yihang","user":"YYY-45","type":"user"},{"_id":"6607b0d29d2edd43f74dec98","avatarUrl":"/avatars/437b5cbc555bf6906c3f07495a903ab4.svg","isPro":false,"fullname":"Zeyu Leo Liu","user":"leo-liuzy","type":"user"},{"_id":"648749094dea003c6dae810f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/648749094dea003c6dae810f/gHUHSBt1zrt8wjO1YwTNu.jpeg","isPro":false,"fullname":"Shrey Pandit","user":"SP2001","type":"user"},{"_id":"641787ad1f1f3b0fa80f5340","avatarUrl":"/avatars/489b7c999bce2160e27c9b487e1026ef.svg","isPro":false,"fullname":"yifei ming","user":"alvinming","type":"user"},{"_id":"64f8a68e33fd3f0fcb7810b5","avatarUrl":"/avatars/cb5ea85e9ed513efaf24a9f36869f98f.svg","isPro":false,"fullname":"Semih Yavuz","user":"syavuz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5f6d64475e78cc6b0ed31e4c","name":"Salesforce","fullname":"Salesforce AI Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1602756670970-noauth.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15113.md"}">

Papers

arxiv:2605.15113

Learning from Language Feedback via Variational Policy Distillation

Published on May 18

· Submitted by

Yang Li on May 21

Salesforce AI Research

Upvote

Authors:

Abstract

Variational Policy Distillation enables reinforcement learning from language feedback by co-evolving teacher and student policies through variational expectation-maximization, overcoming limitations of passive distillation in complex reasoning tasks.

AI-generated summary

Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.

View arXiv page View PDF Add to collection

Community

yli-ml

Paper submitter about 7 hours ago

VPD formalizes learning from language feedback (compiler errors, LLM critiques, self-corrections) as a variational EM problem. Unlike prior self-distillation methods that treat the feedback-conditioned teacher as a frozen function, VPD co-trains the teacher and student in an alternating loop:

E-step: refine the teacher's ability to interpret feedback via preference optimization
M-step: distill the improved teacher into the student on its own rollouts

Both share a single network, so there's zero additional memory overhead.

We evaluate on 3 model families (Qwen3-4B, Qwen3-8B, Llama-3.1-8B) across code generation (LiveCodeBench) and scientific reasoning (SciKnowEval). VPD consistently improves over GRPO and self-distillation baselines, with notably more stable training dynamics. We also characterize where the approach has limitations — on strict mathematical reasoning where error feedback is less informative, standard RL remains stronger.

Happy to discuss — feedback welcome!

librarian-bot

13 minutes ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.15113

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15113 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15113 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15113 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Learning from Language Feedback via Variational Policy Distillation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers