<a href=\"https://echorl.notion.site/ECHO-Terminal-Agents-Learn-World-Models-for-Free-360d62040bac8074b6c1f74cc029a666\" rel=\"nofollow\">https://echorl.notion.site/ECHO-Terminal-Agents-Learn-World-Models-for-Free-360d62040bac8074b6c1f74cc029a666</a></p>\n<p><a href=\"https://x.com/DimitrisPapail/status/2056368948870811746\" rel=\"nofollow\">https://x.com/DimitrisPapail/status/2056368948870811746</a></p>\n","updatedAt":"2026-05-26T17:43:06.827Z","author":{"_id":"63b75a016fc56e43c3c15980","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1672960383634-noauth.jpeg","fullname":"Vaishnavi Shrivastava","name":"vshrivas","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8617687225341797},"editors":["vshrivas"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1672960383634-noauth.jpeg"],"reactions":[],"isReport":false}},{"id":"6a16002c3808a906c28fad8b","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-26T20:18:52.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/echo-terminal-agents-learn-world-models-for-free-8462-d21ada46\nCovers the executive summary, detailed methodology, and practical applications.","html":"<p>Interesting breakdown of this paper on arXivLens: <a href=\"https://arxivlens.com/PaperView/Details/echo-terminal-agents-learn-world-models-for-free-8462-d21ada46\" rel=\"nofollow\">https://arxivlens.com/PaperView/Details/echo-terminal-agents-learn-world-models-for-free-8462-d21ada46</a><br>Covers the executive summary, detailed methodology, and practical applications.</p>\n","updatedAt":"2026-05-26T20:18:52.986Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7721166014671326},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.24517","authors":[{"_id":"6a15d029e9aa3c8e322db141","name":"Vaishnavi Shrivastava","hidden":false},{"_id":"6a15d029e9aa3c8e322db142","name":"Piero Kauffmann","hidden":false},{"_id":"6a15d029e9aa3c8e322db143","name":"Ahmed Awadallah","hidden":false},{"_id":"6a15d029e9aa3c8e322db144","name":"Dimitris Papailiopoulos","hidden":false}],"publishedAt":"2026-05-23T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"ECHO: Terminal Agents Learn World Models for Free","submittedOnDailyBy":{"_id":"63b75a016fc56e43c3c15980","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1672960383634-noauth.jpeg","isPro":false,"fullname":"Vaishnavi Shrivastava","user":"vshrivas","type":"user","name":"vshrivas"},"summary":"CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream -- stdout, errors, files, logs, and traces -- records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.","upvotes":1,"discussionId":"6a15d029e9aa3c8e322db145","ai_summary":"Environment cross-entropy hybrid objective combines policy-gradient loss with auxiliary environment observation prediction to provide dense supervision from terminal feedback, improving agent performance and self-improvement capabilities.","ai_keywords":["CLI agents","language models","terminal execution","environment responses","policy-gradient loss","environment prediction loss","terminal feedback","dense supervision","GRPO","ECHO","environment cross-entropy hybrid objective","action tokens","environment observation tokens","self-improvement","on-policy supervision"],"organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63b75a016fc56e43c3c15980","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1672960383634-noauth.jpeg","isPro":false,"fullname":"Vaishnavi Shrivastava","user":"vshrivas","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.24517.md"}">
ECHO: Terminal Agents Learn World Models for Free
Abstract
Environment cross-entropy hybrid objective combines policy-gradient loss with auxiliary environment observation prediction to provide dense supervision from terminal feedback, improving agent performance and self-improvement capabilities.
AI-generated summary
CLI agents are the closest thing language models have to an embodied setting: the model emits commands, the terminal executes them, and the returned stream -- stdout, errors, files, logs, and traces -- records the consequences. We argue that this stream is a supervision signal, but standard agent RL discards it: GRPO-style training updates action tokens with sparse outcome-level rewards while ignoring environment responses already in the rollout. Failed rollouts provide little policy-gradient signal despite containing rich evidence about how the environment responds. We introduce ECHO (Environment Cross-entropy Hybrid Objective), a hybrid objective that combines the standard policy-gradient loss on action tokens with an auxiliary loss that trains the policy to predict environment observation tokens resulting from its own actions. ECHO reuses the same forward pass as GRPO, requires no additional rollouts, and turns terminal feedback into dense supervision for all rollouts. ECHO doubles GRPO pass@1 on TerminalBench-2.0: Qwen3-8B improves from 2.70% to 5.17%, and Qwen3-14B from 5.17% to 10.79%. ECHO also produces policies that better predict terminal dynamics, even on trajectories they did not generate: across held-out rollouts, it sharply reduces environment-token cross-entropy while GRPO alone barely changes it. From base Qwen3-8B, ECHO matches expert-SFT-then-GRPO performance on held-out terminal tasks without expert demonstrations, and recovers roughly half of the expert-SFT initialization benefit on TerminalBench-2.0. In some settings, the environment prediction loss alone enables verifier-free self-improvement, allowing policies to improve on unseen OOD tasks by learning only from environment interactions. Together, these results suggest that environment observations are not merely context for future actions, but a dense, on-policy supervision signal already present in every rollout.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.24517 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.24517 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.24517 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.