DAgger is back for the agent era: we revisit it for LLM agents and place SFT, RL, OPD, and DAgger under one unified post-training lens. Using our training recipe, 4B SWE agents beat published 8B systems, while 8B agents approach 32B-scale performance.</p>\n","updatedAt":"2026-05-14T01:52:38.342Z","author":{"_id":"660026d9573abbcdb975aa13","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660026d9573abbcdb975aa13/XktfFxTMC_d1xJ4pDJLum.jpeg","fullname":"Changhao","name":"lichangh20","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8614376783370972},"editors":["lichangh20"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/660026d9573abbcdb975aa13/XktfFxTMC_d1xJ4pDJLum.jpeg"],"reactions":[],"isReport":false}},{"id":"6a05d0f05b6ad41a4ea0065a","author":{"_id":"64e22f3c18af51be8e385e83","avatarUrl":"/avatars/089251cfe6a2febcf725706cfc6117e3.svg","fullname":"Yu","name":"hug-ye","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-05-14T13:41:04.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Nice paper! May I ask is there any plan of releasing code?","html":"<p>Nice paper! May I ask is there any plan of releasing code?</p>\n","updatedAt":"2026-05-14T13:41:04.752Z","author":{"_id":"64e22f3c18af51be8e385e83","avatarUrl":"/avatars/089251cfe6a2febcf725706cfc6117e3.svg","fullname":"Yu","name":"hug-ye","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.953623354434967},"editors":["hug-ye"],"editorAvatarUrls":["/avatars/089251cfe6a2febcf725706cfc6117e3.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12913","authors":[{"_id":"6a052ae4b1a8cbabc9f0868a","name":"Changhao Li","hidden":false},{"_id":"6a052ae4b1a8cbabc9f0868b","name":"Rushi Qiang","hidden":false},{"_id":"6a052ae4b1a8cbabc9f0868c","name":"Jiawei Huang","hidden":false},{"_id":"6a052ae4b1a8cbabc9f0868d","user":{"_id":"63413e917a0225764f3b80fe","avatarUrl":"/avatars/8605d8d5f6c4bd219428c77bc0ea599d.svg","isPro":true,"fullname":"Derek G","user":"typoverflow","type":"user","name":"typoverflow"},"name":"Chenxiao Gao","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:56:02.554Z","hidden":false},{"_id":"6a052ae4b1a8cbabc9f0868e","name":"Chao Zhang","hidden":false},{"_id":"6a052ae4b1a8cbabc9f0868f","name":"Niao He","hidden":false},{"_id":"6a052ae4b1a8cbabc9f08690","name":"Bo Dai","hidden":false}],"publishedAt":"2026-05-13T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"Revisiting DAgger in the Era of LLM-Agents","submittedOnDailyBy":{"_id":"660026d9573abbcdb975aa13","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660026d9573abbcdb975aa13/XktfFxTMC_d1xJ4pDJLum.jpeg","isPro":false,"fullname":"Changhao","user":"lichangh20","type":"user","name":"lichangh20"},"summary":"Long-horizon LM agents learn from multi-turn interaction, where a single early mistake can alter the subsequent state distribution and derail the whole trajectory. Existing recipes fall short in complementary ways: supervised fine-tuning provides dense teacher supervision but suffers from covariate shift because it is trained on off-policy teacher trajectories; while reinforcement learning with verifiable rewards avoids this off-policy mismatch by learning from on-policy rollouts but with only sparse outcome feedback. We address this dilemma by revisiting Dataset Aggregation (DAgger) for multi-turn LM agents: the algorithm collects trajectories through a turn-level interpolation of student and teacher policies, and the student is then trained on these trajectories using supervised labels provided by the teacher. By directly interacting with environments, we expose the model to realistic states likely to be encountered during deployment, thereby effectively mitigating covariate shift. Besides, since the student is learned by mimicking the teacher's behavior, it receives rich feedback during learning. To demonstrate DAgger enjoys the benefits of both worlds, we tested the algorithm to train a software-engineering agent with 4B- and 8B-scale student models. On SWE-bench Verified, our DAgger-style training improves over the strongest post-training baseline by +3.9 points at 4B and +3.6 points at 8B. The resulting 4B agent reaches 27.3%, outperforming representative published 8B SWE-agent systems, while the 8B agent achieves 29.8%, surpassing SWE-Gym-32B and coming within 5 points of stronger 32B-scale agents. Together with consistent gains on the held-out SWE-Gym split, these results suggest the effectiveness of DAgger for modern long-horizon LM agents.","upvotes":5,"discussionId":"6a052ae5b1a8cbabc9f08691","ai_summary":"DAgger-style training for long-horizon language model agents combines supervised fine-tuning and reinforcement learning benefits by using teacher-student policy interpolation with on-policy interactions.","ai_keywords":["Dataset Aggregation","DAgger","multi-turn interaction","supervised fine-tuning","reinforcement learning","covariate shift","on-policy rollouts","off-policy trajectories","teacher-student policies","software-engineering agent","SWE-bench Verified","SWE-Gym"],"organization":{"_id":"64155eaa95fb6f824b237c3d","name":"GeorgiaTech","fullname":"Georgia Institute of Technology","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64155e8abe60230f2f40b03a/3i-AL3LrNkaTarSKnaGy8.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"660026d9573abbcdb975aa13","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/660026d9573abbcdb975aa13/XktfFxTMC_d1xJ4pDJLum.jpeg","isPro":false,"fullname":"Changhao","user":"lichangh20","type":"user"},{"_id":"6466e31a14e059dde8bbe4be","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6466e31a14e059dde8bbe4be/D_DWGWEkhOnFdbbdCNP3N.jpeg","isPro":false,"fullname":"Rushi Qiang","user":"Jerrycool","type":"user"},{"_id":"655601f1ae085c2ba7a22b95","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/4UmxFrc_TEiXcnm3RewZM.jpeg","isPro":false,"fullname":"Xiaoji Zheng","user":"Student-Xiaoji","type":"user"},{"_id":"63413e917a0225764f3b80fe","avatarUrl":"/avatars/8605d8d5f6c4bd219428c77bc0ea599d.svg","isPro":true,"fullname":"Derek G","user":"typoverflow","type":"user"},{"_id":"69bb636700a7a879ee90353a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Vkn9KCvlpi029Au5bs7Yo.jpeg","isPro":false,"fullname":"Зайцев Кирилл","user":"levismithru","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64155eaa95fb6f824b237c3d","name":"GeorgiaTech","fullname":"Georgia Institute of Technology","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64155e8abe60230f2f40b03a/3i-AL3LrNkaTarSKnaGy8.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12913.md"}">
Revisiting DAgger in the Era of LLM-Agents
Abstract
DAgger-style training for long-horizon language model agents combines supervised fine-tuning and reinforcement learning benefits by using teacher-student policy interpolation with on-policy interactions.
AI-generated summary
Long-horizon LM agents learn from multi-turn interaction, where a single early mistake can alter the subsequent state distribution and derail the whole trajectory. Existing recipes fall short in complementary ways: supervised fine-tuning provides dense teacher supervision but suffers from covariate shift because it is trained on off-policy teacher trajectories; while reinforcement learning with verifiable rewards avoids this off-policy mismatch by learning from on-policy rollouts but with only sparse outcome feedback. We address this dilemma by revisiting Dataset Aggregation (DAgger) for multi-turn LM agents: the algorithm collects trajectories through a turn-level interpolation of student and teacher policies, and the student is then trained on these trajectories using supervised labels provided by the teacher. By directly interacting with environments, we expose the model to realistic states likely to be encountered during deployment, thereby effectively mitigating covariate shift. Besides, since the student is learned by mimicking the teacher's behavior, it receives rich feedback during learning. To demonstrate DAgger enjoys the benefits of both worlds, we tested the algorithm to train a software-engineering agent with 4B- and 8B-scale student models. On SWE-bench Verified, our DAgger-style training improves over the strongest post-training baseline by +3.9 points at 4B and +3.6 points at 8B. The resulting 4B agent reaches 27.3%, outperforming representative published 8B SWE-agent systems, while the 8B agent achieves 29.8%, surpassing SWE-Gym-32B and coming within 5 points of stronger 32B-scale agents. Together with consistent gains on the held-out SWE-Gym split, these results suggest the effectiveness of DAgger for modern long-horizon LM agents.
Community
DAgger is back for the agent era: we revisit it for LLM agents and place SFT, RL, OPD, and DAgger under one unified post-training lens. Using our training recipe, 4B SWE agents beat published 8B systems, while 8B agents approach 32B-scale performance.
Nice paper! May I ask is there any plan of releasing code?
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.12913 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.12913 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.12913 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.