Hugging Face Daily Papers · June 8, 2026 · 5 min read

Reinforcement Learning from Rich Feedback with Distributional DAgger

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We found something surprising about existing self-distillation methods:\n𝗘𝘃𝗲𝗻 𝘄𝗵𝗲𝗻 𝘁𝗵𝗲 𝗳𝗲𝗲𝗱𝗯𝗮𝗰𝗸-𝗰𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻𝗲𝗱 \"𝘁𝗲𝗮𝗰𝗵𝗲𝗿\" 𝗶𝘀 𝗯𝗲𝘁𝘁𝗲𝗿 𝘁𝗵𝗮𝗻 𝘁𝗵𝗲 𝘀𝘁𝘂𝗱𝗲𝗻𝘁, 𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗻𝗴 𝗶𝘁 𝗰𝗮𝗻 𝘀𝘁𝗶𝗹𝗹 𝗺𝗮𝗸𝗲 𝘁𝗵𝗲 𝘀𝘁𝘂𝗱𝗲𝗻𝘁 𝘄𝗼𝗿𝘀𝗲.\nThis is particularly striking because self-distillation has become one of the most promising ways to go beyond RLVR, where every token receives the same trajectory-level reward.\nSo we asked:\n𝗖𝗮𝗻 𝘄𝗲 𝗹𝗲𝗮𝗿𝗻 𝗳𝗿𝗼𝗺 𝗿𝗶𝗰𝗵 𝗳𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝗶𝗻 𝗮 𝘄𝗮𝘆 𝘁𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗴𝘂𝗮𝗿𝗮𝗻𝘁𝗲𝗲𝘀 𝗺𝗼𝗻𝗼𝘁𝗼𝗻𝗶𝗰 𝗽𝗼𝗹𝗶𝗰𝘆 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁?\nIntroducing 𝗗𝗶𝘀𝘁𝗜𝗟: 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗳𝗿𝗼𝗺 𝗥𝗶𝗰𝗵 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝘄𝗶𝘁𝗵 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝗮𝗹 𝗗𝗔𝗴𝗴𝗲𝗿.\nCore idea: view rich-feedback RL as distributional imitation learning.\nThis gives: • monotonic policy improvement guarantees • regret bounds\nAnd empirically, DistIL improves over RLVR, SDPO, and OPSD on: • science reasoning • coding • mathematical reasoning\nPaper: <a href=\"https://arxiv.org/pdf/2606.05152\" rel=\"nofollow\">https://arxiv.org/pdf/2606.05152</a>\n","updatedAt":"2026-06-08T15:26:21.858Z","author":{"_id":"671d4f5b15313d2c0da5b363","avatarUrl":"/avatars/c0d78f284d8da4101950a71d97cdc7aa.svg","fullname":"Rishabh Agrawal","name":"rish-1086","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6317383646965027},"editors":["rish-1086"],"editorAvatarUrls":["/avatars/c0d78f284d8da4101950a71d97cdc7aa.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05152","authors":[{"_id":"6a230b74e4c258a029491767","user":{"_id":"671d4f5b15313d2c0da5b363","avatarUrl":"/avatars/c0d78f284d8da4101950a71d97cdc7aa.svg","isPro":false,"fullname":"Rishabh Agrawal","user":"rish-1086","type":"user","name":"rish-1086"},"name":"Rishabh Agrawal","status":"claimed_verified","statusLastChangedAt":"2026-06-08T09:46:47.251Z","hidden":false},{"_id":"6a230b74e4c258a029491768","name":"Jacob Fein-Ashley","hidden":false},{"_id":"6a230b74e4c258a029491769","name":"Paria Rashidinejad","hidden":false}],"publishedAt":"2026-06-03T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"Reinforcement Learning from Rich Feedback with Distributional DAgger","submittedOnDailyBy":{"_id":"671d4f5b15313d2c0da5b363","avatarUrl":"/avatars/c0d78f284d8da4101950a71d97cdc7aa.svg","isPro":false,"fullname":"Rishabh Agrawal","user":"rish-1086","type":"user","name":"rish-1086"},"summary":"Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.","upvotes":1,"discussionId":"6a230b74e4c258a02949176a","projectPage":"https://rishabh-1086.github.io/project-distIL","githubRepo":"https://github.com/rishabh-1086/distIL","githubRepoAddedBy":"user","ai_summary":"Forward cross-entropy objective with distributional imitation learning enables monotonic policy improvement and better performance in reasoning tasks compared to traditional reinforcement learning methods.","ai_keywords":["reinforcement learning from verifiable rewards","DAgger","imitation learning","cross-entropy objective","policy improvement","regret bounds","teacher-weighted likelihood","Pass@N","self-distillation","reverse KL","Jensen-Shannon"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"66a403d0dcb5bbc6e98bb7d0","name":"UniversityofSouthernCalifornia","fullname":"University of Southern California","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66a403728069e3c30e0d8524/tkYCfeIJfF1FxtYiRZ8bf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"671d4f5b15313d2c0da5b363","avatarUrl":"/avatars/c0d78f284d8da4101950a71d97cdc7aa.svg","isPro":false,"fullname":"Rishabh Agrawal","user":"rish-1086","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66a403d0dcb5bbc6e98bb7d0","name":"UniversityofSouthernCalifornia","fullname":"University of Southern California","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66a403728069e3c30e0d8524/tkYCfeIJfF1FxtYiRZ8bf.png"}}">

Papers

arxiv:2606.05152

Reinforcement Learning from Rich Feedback with Distributional DAgger

Published on Jun 3

· Submitted by

Rishabh Agrawal on Jun 8

University of Southern California

Upvote

Authors:

Rishabh Agrawal ,

Abstract

Forward cross-entropy objective with distributional imitation learning enables monotonic policy improvement and better performance in reasoning tasks compared to traditional reinforcement learning methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

rish-1086

Paper author Paper submitter about 5 hours ago

We found something surprising about existing self-distillation methods:

𝗘𝘃𝗲𝗻 𝘄𝗵𝗲𝗻 𝘁𝗵𝗲 𝗳𝗲𝗲𝗱𝗯𝗮𝗰𝗸-𝗰𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻𝗲𝗱 "𝘁𝗲𝗮𝗰𝗵𝗲𝗿" 𝗶𝘀 𝗯𝗲𝘁𝘁𝗲𝗿 𝘁𝗵𝗮𝗻 𝘁𝗵𝗲 𝘀𝘁𝘂𝗱𝗲𝗻𝘁, 𝗶𝗺𝗶𝘁𝗮𝘁𝗶𝗻𝗴 𝗶𝘁 𝗰𝗮𝗻 𝘀𝘁𝗶𝗹𝗹 𝗺𝗮𝗸𝗲 𝘁𝗵𝗲 𝘀𝘁𝘂𝗱𝗲𝗻𝘁 𝘄𝗼𝗿𝘀𝗲.

This is particularly striking because self-distillation has become one of the most promising ways to go beyond RLVR, where every token receives the same trajectory-level reward.

So we asked:

𝗖𝗮𝗻 𝘄𝗲 𝗹𝗲𝗮𝗿𝗻 𝗳𝗿𝗼𝗺 𝗿𝗶𝗰𝗵 𝗳𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝗶𝗻 𝗮 𝘄𝗮𝘆 𝘁𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝗴𝘂𝗮𝗿𝗮𝗻𝘁𝗲𝗲𝘀 𝗺𝗼𝗻𝗼𝘁𝗼𝗻𝗶𝗰 𝗽𝗼𝗹𝗶𝗰𝘆 𝗶𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁?

Introducing 𝗗𝗶𝘀𝘁𝗜𝗟: 𝗥𝗲𝗶𝗻𝗳𝗼𝗿𝗰𝗲𝗺𝗲𝗻𝘁 𝗟𝗲𝗮𝗿𝗻𝗶𝗻𝗴 𝗳𝗿𝗼𝗺 𝗥𝗶𝗰𝗵 𝗙𝗲𝗲𝗱𝗯𝗮𝗰𝗸 𝘄𝗶𝘁𝗵 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗶𝗼𝗻𝗮𝗹 𝗗𝗔𝗴𝗴𝗲𝗿.

Core idea: view rich-feedback RL as distributional imitation learning.

This gives:
• monotonic policy improvement guarantees
• regret bounds

And empirically, DistIL improves over RLVR, SDPO, and OPSD on:
• science reasoning
• coding
• mathematical reasoning

Paper: https://arxiv.org/pdf/2606.05152

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05152 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.05152 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05152 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Reinforcement Learning from Rich Feedback with Distributional DAgger

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers