Hugging Face Daily Papers · June 2, 2026 · 6 min read

Policy and World Modeling Co-Training for Language Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Standard RL rewards LLM agents for good actions, but ignores what those actions change. Our proposed PaW turns every RL rollout into world-modeling supervision, helping agents predict next observations and act more reliably in long-horizon tasks, without adding any deployment cost.\n","updatedAt":"2026-06-02T15:31:40.720Z","author":{"_id":"6400b125a3b8fe3ac0ecb6f3","avatarUrl":"/avatars/9a17f6064a71a944e770860239688654.svg","fullname":"Ning Lu","name":"ColinLu50","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8859179615974426},"editors":["ColinLu50"],"editorAvatarUrls":["/avatars/9a17f6064a71a944e770860239688654.svg"],"reactions":[],"isReport":false}},{"id":"6a1f8a793d09b7be53d12c11","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false},"createdAt":"2026-06-03T01:59:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [GAGPO: Generalized Advantage Grouped Policy Optimization](https://huggingface.co/papers/2605.13217) (2026)\n* [StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction](https://huggingface.co/papers/2605.06642) (2026)\n* [Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy](https://huggingface.co/papers/2605.14558) (2026)\n* [Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning](https://huggingface.co/papers/2605.26684) (2026)\n* [GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering](https://huggingface.co/papers/2605.29584) (2026)\n* [AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning](https://huggingface.co/papers/2605.00425) (2026)\n* [ECHO: Terminal Agents Learn World Models for Free](https://huggingface.co/papers/2605.24517) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.13217\">GAGPO: Generalized Advantage Grouped Policy Optimization</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.06642\">StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.14558\">Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.26684\">Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.29584\">GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.00425\">AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.24517\">ECHO: Terminal Agents Learn World Models for Free</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-06-03T01:59:21.491Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7243198752403259},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02388","authors":[{"_id":"6a1e4f85808ddbc3c7d43d23","user":{"_id":"6400b125a3b8fe3ac0ecb6f3","avatarUrl":"/avatars/9a17f6064a71a944e770860239688654.svg","isPro":false,"fullname":"Ning Lu","user":"ColinLu50","type":"user","name":"ColinLu50"},"name":"Ning Lu","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:08:16.750Z","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d24","name":"Baijiong Lin","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d25","name":"Shengcai Liu","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d26","name":"Jiahao Wu","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d27","name":"Haoze Lv","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d28","name":"Yanbin Wei","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d29","name":"Lingting Zhu","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d2a","name":"Shengju Qian","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d2b","name":"Xin Wang","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d2c","name":"Ying-Cong Chen","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d2d","name":"Qi Wang","hidden":false},{"_id":"6a1e4f85808ddbc3c7d43d2e","name":"Ke Tang","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Policy and World Modeling Co-Training for Language Agents","submittedOnDailyBy":{"_id":"6400b125a3b8fe3ac0ecb6f3","avatarUrl":"/avatars/9a17f6064a71a944e770860239688654.svg","isPro":false,"fullname":"Ning Lu","user":"ColinLu50","type":"user","name":"ColinLu50"},"summary":"Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.","upvotes":7,"discussionId":"6a1e4f85808ddbc3c7d43d2f","ai_summary":"PaW is a co-training framework that combines policy learning and world modeling using on-policy reinforcement learning rollouts to improve language agent training without additional computational overhead.","ai_keywords":["reinforcement learning","large language models","world modeling","policy training","on-policy RL","auxiliary supervision","action-entropy-based data selection","noise-tolerant loss","reward-adaptive loss balancing"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"63355133edc1a61aecf74b0e","name":"HKUST","fullname":"HKUST","avatar":"https://www.gravatar.com/avatar/4a4318de793d2c187cb6f312e9d0e7bc?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6400b125a3b8fe3ac0ecb6f3","avatarUrl":"/avatars/9a17f6064a71a944e770860239688654.svg","isPro":false,"fullname":"Ning Lu","user":"ColinLu50","type":"user"},{"_id":"6638e88eeea86e1815d562c3","avatarUrl":"/avatars/d9e94348167f3d3918a0dbd15a4da79d.svg","isPro":false,"fullname":"Wu","user":"JiahaoJH","type":"user"},{"_id":"6181435169071e2bddc65578","avatarUrl":"/avatars/c898f76c13d380f47b131bcb6289bb9a.svg","isPro":false,"fullname":"Yunhao Gou","user":"gyhdog","type":"user"},{"_id":"654a1b4fcaf723fbb247bc2c","avatarUrl":"/avatars/1d30c9fc81f0b5baa3b865174a345a4e.svg","isPro":true,"fullname":"Zhefan Rao","user":"Glanty","type":"user"},{"_id":"65ced159d82f8d722c78e0cf","avatarUrl":"/avatars/a7121348e4fe77055533f14cda7f90b8.svg","isPro":false,"fullname":"Hansi Yang","user":"animawang","type":"user"},{"_id":"6a0d9a9416d9b05a2a1d31fe","avatarUrl":"/avatars/779d62825b41cfcafec8420b1f4ff58f.svg","isPro":false,"fullname":"xuezhalin","user":"xzzlin","type":"user"},{"_id":"66821fc474fb1736a437ce54","avatarUrl":"/avatars/3f8945f813ac39a2027e6bcf7bc52159.svg","isPro":false,"fullname":"HuaLi","user":"phystar","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63355133edc1a61aecf74b0e","name":"HKUST","fullname":"HKUST","avatar":"https://www.gravatar.com/avatar/4a4318de793d2c187cb6f312e9d0e7bc?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02388.md"}">

Papers

arxiv:2606.02388

Policy and World Modeling Co-Training for Language Agents

Published on Jun 1

· Submitted by

Ning Lu on Jun 2

HKUST

Upvote

Authors:

Ning Lu ,

Abstract

PaW is a co-training framework that combines policy learning and world modeling using on-policy reinforcement learning rollouts to improve language agent training without additional computational overhead.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.

View arXiv page View PDF Add to collection

Community

ColinLu50

Paper author Paper submitter about 10 hours ago

librarian-bot

1 minute ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.02388

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02388 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.02388 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02388 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Policy and World Modeling Co-Training for Language Agents

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers