<video src=\"https://cdn-uploads.huggingface.co/production/uploads/62cd3a3691d27e60db0698b0/So7aKrPUSA_JMV4PEZ7nX.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>\n\n<p>RHO (Retrospective Harness Optimization) improves an LLM agent's harness — its skills, tools, and workflows — using only the agent's own past trajectories, with no ground-truth validation set. It selects a difficulty-diverse coreset of past tasks with a DPP, re-solves each task in parallel, diagnoses failures via self-validation and self-consistency, and picks among candidate harness updates by pairwise self-preference. A single optimization round improves SWE-Bench Pro pass rate from 59% to 78% without any external grading, with consistent gains on Terminal-Bench 2 and GAIA-2.</p>\n","updatedAt":"2026-06-10T07:21:34.192Z","author":{"_id":"62cd3a3691d27e60db0698b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd3a3691d27e60db0698b0/eKh813jAE6g3HbzpzMpCb.jpeg","fullname":"Wenbo Pan","name":"wenbopan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":35,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.8918190598487854},"editors":["wenbopan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62cd3a3691d27e60db0698b0/eKh813jAE6g3HbzpzMpCb.jpeg"],"reactions":[{"reaction":"🔥","users":["hf-timmyy","wenbopan"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05922","authors":[{"_id":"6a28f1a3e7d78ea7587e558b","name":"Wenbo Pan","hidden":false},{"_id":"6a28f1a3e7d78ea7587e558c","name":"Shujie Liu","hidden":false},{"_id":"6a28f1a3e7d78ea7587e558d","name":"Chin-Yew Lin","hidden":false},{"_id":"6a28f1a3e7d78ea7587e558e","name":"Jingying Zeng","hidden":false},{"_id":"6a28f1a3e7d78ea7587e558f","name":"Xianfeng Tang","hidden":false},{"_id":"6a28f1a3e7d78ea7587e5590","name":"Xiangyang Zhou","hidden":false},{"_id":"6a28f1a3e7d78ea7587e5591","name":"Yan Lu","hidden":false},{"_id":"6a28f1a3e7d78ea7587e5592","name":"Xiaohua Jia","hidden":false}],"publishedAt":"2026-06-04T09:26:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts","submittedOnDailyBy":{"_id":"62cd3a3691d27e60db0698b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd3a3691d27e60db0698b0/eKh813jAE6g3HbzpzMpCb.jpeg","isPro":false,"fullname":"Wenbo Pan","user":"wenbopan","type":"user","name":"wenbopan"},"summary":"AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.","upvotes":47,"discussionId":"6a28f1a3e7d78ea7587e5593","projectPage":"https://paper-rho.wenbo.io","githubRepo":"https://github.com/wbopan/retro-harness","githubRepoAddedBy":"user","ai_summary":"Retrospective Harness Optimization (RHO) is a self-supervised method that improves AI agent performance by optimizing agent harness using only past trajectories through diverse task selection, parallel re-solving, and self-validation techniques.","ai_keywords":["Retrospective Harness Optimization","self-supervised method","agent harness","past trajectories","coreset","parallel re-solving","self-validation","self-consistency","pairwise self-preference","SWE-Bench Pro"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":14,"organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62cd3a3691d27e60db0698b0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd3a3691d27e60db0698b0/eKh813jAE6g3HbzpzMpCb.jpeg","isPro":false,"fullname":"Wenbo Pan","user":"wenbopan","type":"user"},{"_id":"64dc94f1bdac91d86834e13b","avatarUrl":"/avatars/264ebe3867121f56eefac6e1ab8cf17f.svg","isPro":false,"fullname":"Shujie Liu","user":"j4ckl1u","type":"user"},{"_id":"65ea1406a5ea41e870756096","avatarUrl":"/avatars/083b19e83be33530d71af3f6c889b9ac.svg","isPro":false,"fullname":"Joy","user":"yalishandajz","type":"user"},{"_id":"6465f6467ff8fcbef7d22513","avatarUrl":"/avatars/07992835c235fbb07016a0ea4f1d61cb.svg","isPro":false,"fullname":"Xianfeng Tang","user":"xianft","type":"user"},{"_id":"64638bf351fa6e6306051fdb","avatarUrl":"/avatars/59db83a7f5ca981f07c35228800327b2.svg","isPro":false,"fullname":"Lin","user":"Chin-Yew","type":"user"},{"_id":"643f615aa16cd6d1f4c581de","avatarUrl":"/avatars/47753a3e82b44f81881600c52e1e8495.svg","isPro":false,"fullname":"Yeyun Gong","user":"yegong","type":"user"},{"_id":"63ef330b1e695b35aa484e11","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63ef330b1e695b35aa484e11/bXwpGy0dl8JXeJwJ--ilr.jpeg","isPro":false,"fullname":"Qianhui WU","user":"qianhuiwu","type":"user"},{"_id":"64b785384df206a3ed142dc0","avatarUrl":"/avatars/501a90b2c80d9b3a2e0d1819a4211f84.svg","isPro":false,"fullname":"Da Yu","user":"Jellyfish0538","type":"user"},{"_id":"6a28fdf16cf38df26f68f2a9","avatarUrl":"/avatars/2095c01812609f97e3aff645ac09a3f4.svg","isPro":false,"fullname":"Zhitao Hou","user":"zhith","type":"user"},{"_id":"6475937c04c82116f9b9fef8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6475937c04c82116f9b9fef8/kCamEKnnRRy5EKZ_Ayis8.jpeg","isPro":false,"fullname":"Zhiwei","user":"Arlene4869","type":"user"},{"_id":"6a28fe0d12b6c5ac38943772","avatarUrl":"/avatars/38b9c910bb17704ba4124b10d4ee5934.svg","isPro":false,"fullname":"Huang","user":"Xiaotao231","type":"user"},{"_id":"687d9a0bdd7134d36e92a144","avatarUrl":"/avatars/9e0c22ba4b20327fe20ebcb523094365.svg","isPro":false,"fullname":"Liu","user":"DanielJing","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"68151d0f51add3813f3f7d1b","name":"MicrosoftResearch","fullname":"Microsoft Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6529a4f2f1205983224fa513/PeuVr7jSuJflmDBBGxoDX.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.05922.md"}">
Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
Abstract
Retrospective Harness Optimization (RHO) is a self-supervised method that improves AI agent performance by optimizing agent harness using only past trajectories through diverse task selection, parallel re-solving, and self-validation techniques.
AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.
Community
RHO (Retrospective Harness Optimization) improves an LLM agent's harness — its skills, tools, and workflows — using only the agent's own past trajectories, with no ground-truth validation set. It selects a difficulty-diverse coreset of past tasks with a DPP, re-solves each task in parallel, diagnoses failures via self-validation and self-consistency, and picks among candidate harness updates by pairwise self-preference. A single optimization round improves SWE-Bench Pro pass rate from 59% to 78% without any external grading, with consistent gains on Terminal-Bench 2 and GAIA-2.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.05922 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.05922 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.05922 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.