Standard RL rewards LLM agents for good actions, but ignores what those actions change. Our proposed PaW turns every RL rollout into world-modeling supervision, helping agents predict next observations and act more reliably in long-horizon tasks, without adding any deployment cost.</p>\n","updatedAt":"2026-06-02T15:31:40.720Z","author":{"_id":"6400b125a3b8fe3ac0ecb6f3","avatarUrl":"/avatars/9a17f6064a71a944e770860239688654.svg","fullname":"Ning Lu","name":"ColinLu50","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8859179615974426},"editors":["ColinLu50"],"editorAvatarUrls":["/avatars/9a17f6064a71a944e770860239688654.svg"],"reactions":[],"isReport":false}},{"id":"6a1f8a793d09b7be53d12c11","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false},"createdAt":"2026-06-03T01:59:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [GAGPO: Generalized Advantage Grouped Policy Optimization](https://huggingface.co/papers/2605.13217) (2026)\n* [StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction](https://huggingface.co/papers/2605.06642) (2026)\n* [Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy](https://huggingface.co/papers/2605.14558) (2026)\n* [Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning](https://huggingface.co/papers/2605.26684) (2026)\n* [GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering](https://huggingface.co/papers/2605.29584) (2026)\n* [AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning](https://huggingface.co/papers/2605.00425) (2026)\n* [ECHO: Terminal Agents Learn World Models for Free](https://huggingface.co/papers/2605.24517) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.13217\">GAGPO: Generalized Advantage Grouped Policy Optimization</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.06642\">StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.14558\">Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.26684\">Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.29584\">GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.00425\">AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.24517\">ECHO: Terminal Agents Learn World Models for Free</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-06-03T01:59:21.491Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7243198752403259},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02388","authors":[{"_id":"6a1e4f85808ddbc3c7d43d23","user":{"_id":"6400b125a3b8fe3ac0ecb6f3","avatarUrl":"/avatars/9a17f6064a71a944e770860239688654.svg","isPro":false,"fullname":"Ning Lu","user":"ColinLu50","type":"user","name":"ColinLu50"},"name":"Ning Lu","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:08:16.750Z","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d24","name":"Baijiong Lin","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d25","name":"Shengcai Liu","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d26","name":"Jiahao Wu","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d27","name":"Haoze Lv","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d28","name":"Yanbin Wei","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d29","name":"Lingting Zhu","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d2a","name":"Shengju Qian","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d2b","name":"Xin Wang","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d2c","name":"Ying-Cong Chen","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d2d","name":"Qi Wang","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d2e","name":"Ke Tang","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Policy and World Modeling Co-Training for Language Agents","submittedOnDailyBy":{"_id":"6400b125a3b8fe3ac0ecb6f3","avatarUrl":"/avatars/9a17f6064a71a944e770860239688654.svg","isPro":false,"fullname":"Ning Lu","user":"ColinLu50","type":"user","name":"ColinLu50"},"summary":"Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.","upvotes":7,"discussionId":"6a1e4f85808ddbc3c7d43d2f","ai_summary":"PaW is a co-training framework that combines policy learning and world modeling using on-policy reinforcement learning rollouts to improve language agent training without additional computational overhead.","ai_keywords":["reinforcement learning","large language models","world modeling","policy training","on-policy RL","auxiliary supervision","action-entropy-based data selection","noise-tolerant loss","reward-adaptive loss balancing"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"63355133edc1a61aecf74b0e","name":"HKUST","fullname":"HKUST","avatar":"https://www.gravatar.com/avatar/4a4318de793d2c187cb6f312e9d0e7bc?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6400b125a3b8fe3ac0ecb6f3","avatarUrl":"/avatars/9a17f6064a71a944e770860239688654.svg","isPro":false,"fullname":"Ning Lu","user":"ColinLu50","type":"user"},{"_id":"6638e88eeea86e1815d562c3","avatarUrl":"/avatars/d9e94348167f3d3918a0dbd15a4da79d.svg","isPro":false,"fullname":"Wu","user":"JiahaoJH","type":"user"},{"_id":"6181435169071e2bddc65578","avatarUrl":"/avatars/c898f76c13d380f47b131bcb6289bb9a.svg","isPro":false,"fullname":"Yunhao Gou","user":"gyhdog","type":"user"},{"_id":"654a1b4fcaf723fbb247bc2c","avatarUrl":"/avatars/1d30c9fc81f0b5baa3b865174a345a4e.svg","isPro":true,"fullname":"Zhefan Rao","user":"Glanty","type":"user"},{"_id":"65ced159d82f8d722c78e0cf","avatarUrl":"/avatars/a7121348e4fe77055533f14cda7f90b8.svg","isPro":false,"fullname":"Hansi Yang","user":"animawang","type":"user"},{"_id":"6a0d9a9416d9b05a2a1d31fe","avatarUrl":"/avatars/779d62825b41cfcafec8420b1f4ff58f.svg","isPro":false,"fullname":"xuezhalin","user":"xzzlin","type":"user"},{"_id":"66821fc474fb1736a437ce54","avatarUrl":"/avatars/3f8945f813ac39a2027e6bcf7bc52159.svg","isPro":false,"fullname":"HuaLi","user":"phystar","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63355133edc1a61aecf74b0e","name":"HKUST","fullname":"HKUST","avatar":"https://www.gravatar.com/avatar/4a4318de793d2c187cb6f312e9d0e7bc?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02388.md"}">
Policy and World Modeling Co-Training for Language Agents
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
PaW is a co-training framework that combines policy learning and world modeling using on-policy reinforcement learning rollouts to improve language agent training without additional computational overhead.
Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.
Community
Standard RL rewards LLM agents for good actions, but ignores what those actions change. Our proposed PaW turns every RL rollout into world-modeling supervision, helping agents predict next observations and act more reliably in long-horizon tasks, without adding any deployment cost.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.02388 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.02388 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.02388 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.