Continual Harness: Online Adaptation for Self-Improving Foundation Agents</p>\n","updatedAt":"2026-05-13T03:33:37.592Z","author":{"_id":"6658e1c8ce1b2838885b2d7f","avatarUrl":"/avatars/8623555f14b62f40fd372da20cb59ccc.svg","fullname":"Seth Karten","name":"milkkarten","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7397688031196594},"editors":["milkkarten"],"editorAvatarUrls":["/avatars/8623555f14b62f40fd372da20cb59ccc.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.09998","authors":[{"_id":"6a03eed186b054ce2fa40e87","user":{"_id":"6658e1c8ce1b2838885b2d7f","avatarUrl":"/avatars/8623555f14b62f40fd372da20cb59ccc.svg","isPro":false,"fullname":"Seth Karten","user":"milkkarten","type":"user","name":"milkkarten"},"name":"Seth Karten","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:43:55.633Z","hidden":false},{"_id":"6a03eed186b054ce2fa40e88","name":"Joel Zhang","hidden":false},{"_id":"6a03eed186b054ce2fa40e89","name":"Tersoo Upaa Jr","hidden":false},{"_id":"6a03eed186b054ce2fa40e8a","name":"Ruirong Feng","hidden":false},{"_id":"6a03eed186b054ce2fa40e8b","name":"Wenzhe Li","hidden":false},{"_id":"6a03eed186b054ce2fa40e8c","name":"Chengshuai Shi","hidden":false},{"_id":"6a03eed186b054ce2fa40e8d","name":"Chi Jin","hidden":false},{"_id":"6a03eed186b054ce2fa40e8e","name":"Kiran Vodrahalli","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6658e1c8ce1b2838885b2d7f/QvMJnzzKrYMkLTjnVMkhO.png"],"publishedAt":"2026-05-11T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Continual Harness: Online Adaptation for Self-Improving Foundation Agents","submittedOnDailyBy":{"_id":"6658e1c8ce1b2838885b2d7f","avatarUrl":"/avatars/8623555f14b62f40fd372da20cb59ccc.svg","isPro":false,"fullname":"Seth Karten","user":"milkkarten","type":"user","name":"milkkarten"},"summary":"Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.","upvotes":10,"discussionId":"6a03eed186b054ce2fa40e8f","projectPage":"https://sethkarten.ai/continual-harness/","githubRepo":"https://github.com/sethkarten/continual-harness","githubRepoAddedBy":"user","ai_summary":"A self-improving AI system for embodied agents autonomously refines its own prompts, skills, and memory through continuous learning without environment resets, achieving human-level performance in complex video games.","ai_keywords":["embodied agents","long-horizon partial-observability decision-making","prompt-optimization methods","continual harness","online process-reward co-learning loop","frontier models","self-improvement signals","reset-free training","agent refinement","skill development"],"githubStars":89,"organization":{"_id":"64374111a701a7e744c02b0e","name":"princetonu","fullname":"Princeton University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/b3xXusq8Zz3ej8Z6fRTSZ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6658e1c8ce1b2838885b2d7f","avatarUrl":"/avatars/8623555f14b62f40fd372da20cb59ccc.svg","isPro":false,"fullname":"Seth Karten","user":"milkkarten","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6664db92c8779e71799a0a90","avatarUrl":"/avatars/1515da6fdb0a53d7ba4e2ca941d53b73.svg","isPro":false,"fullname":"Naicheng Yu","user":"naichengyu","type":"user"},{"_id":"63a002712e13e54dcbca4b60","avatarUrl":"/avatars/8790e81e7f1e0a569d08b974f4036776.svg","isPro":false,"fullname":"John Doe","user":"waylaidwanderer","type":"user"},{"_id":"656977a8046899997b8f138c","avatarUrl":"/avatars/7c5639355f76a352ca31ae2c4efc9e0d.svg","isPro":false,"fullname":"Wenzhe Li","user":"wenzhe-li","type":"user"},{"_id":"686b8ce3349877edc66f166c","avatarUrl":"/avatars/edf542912c8fac2c0b416f7edadf8bec.svg","isPro":false,"fullname":"Benedikt Schink","user":"brombene","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"667ea85371e41f524c9b5d08","avatarUrl":"/avatars/fe48f281c3827f5a75192c9585c4d6f3.svg","isPro":false,"fullname":"Ruirong Feng","user":"RayF6699","type":"user"},{"_id":"6450080bf8b353c949196ab6","avatarUrl":"/avatars/78e0a8f68e8b9be761208a300b682839.svg","isPro":false,"fullname":"Xinran Liang","user":"xinranliang","type":"user"},{"_id":"6a04cb1c34817feea761981b","avatarUrl":"/avatars/85dda8a06559838ce1a7a676bb96a6c0.svg","isPro":false,"fullname":"Matin Urdu","user":"DariusMatin","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64374111a701a7e744c02b0e","name":"princetonu","fullname":"Princeton University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/b3xXusq8Zz3ej8Z6fRTSZ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.09998.md"}">
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
Abstract
A self-improving AI system for embodied agents autonomously refines its own prompts, skills, and memory through continuous learning without environment resets, achieving human-level performance in complex video games.
AI-generated summary
Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.
Community
Continual Harness: Online Adaptation for Self-Improving Foundation Agents
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.09998 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.09998 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.09998 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.