This paper studies auto-harness LLM agents — which improve agent system by editing harness (prompts/skills/tools/memories) instead of model weights —\n> under open-ended task *streams*, where a single densely-updated harness goes\n> brittle (accuracy peaks early, then declines). It frames the gap to an oracle\n> harness as two losses: *evolution loss* (the evolver's limited ability to build\n> good harnesses from history) and *adaptation loss* (committing to one harness\n> before seeing the task).\n>\n> The system reduces evolution loss with a stateful multi-agent evolver\n> (Analyst→Researchers→Builder→Verifier, cross-cycle memory, temporal-reveal\n> feedback) and adaptation loss with a harness tree plus solve-time routing.\n> Across prediction-market, CTF, and event-forecasting streams it beats five\n> auto-harness baselines (e.g. PolyBench 80.9% accuracy, +330 coverage-scaled\n> return). Code: https://github.com/A-EVO-Lab/a-evolve/tree/release/adaptive-auto-harness","html":"<blockquote>\n<p>This paper studies auto-harness LLM agents — which improve agent system by editing harness (prompts/skills/tools/memories) instead of model weights —<br>under open-ended task <em>streams</em>, where a single densely-updated harness goes<br>brittle (accuracy peaks early, then declines). It frames the gap to an oracle<br>harness as two losses: <em>evolution loss</em> (the evolver's limited ability to build<br>good harnesses from history) and <em>adaptation loss</em> (committing to one harness<br>before seeing the task).</p>\n<p>The system reduces evolution loss with a stateful multi-agent evolver<br>(Analyst→Researchers→Builder→Verifier, cross-cycle memory, temporal-reveal<br>feedback) and adaptation loss with a harness tree plus solve-time routing.<br>Across prediction-market, CTF, and event-forecasting streams it beats five<br>auto-harness baselines (e.g. PolyBench 80.9% accuracy, +330 coverage-scaled<br>return). Code: <a href=\"https://github.com/A-EVO-Lab/a-evolve/tree/release/adaptive-auto-harness\" rel=\"nofollow\">https://github.com/A-EVO-Lab/a-evolve/tree/release/adaptive-auto-harness</a></p>\n</blockquote>\n","updatedAt":"2026-06-03T16:46:56.080Z","author":{"_id":"6a1e20a6fbe4403542c39710","avatarUrl":"/avatars/9e865914e418b2415c218584d1b48e79.svg","fullname":"Zewen Liu","name":"nuuuh","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.852297306060791},"editors":["nuuuh"],"editorAvatarUrls":["/avatars/9e865914e418b2415c218584d1b48e79.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01770","authors":[{"_id":"6a1fa827e292c1c78ecb13c9","user":{"_id":"6a1e20a6fbe4403542c39710","avatarUrl":"/avatars/9e865914e418b2415c218584d1b48e79.svg","isPro":false,"fullname":"Zewen Liu","user":"nuuuh","type":"user","name":"nuuuh"},"name":"Zewen Liu","status":"claimed_verified","statusLastChangedAt":"2026-06-03T14:19:11.047Z","hidden":false},{"_id":"6a1fa827e292c1c78ecb13ca","name":"Zhan Shi","hidden":false},{"_id":"6a1fa827e292c1c78ecb13cb","name":"Yisi Sang","hidden":false},{"_id":"6a1fa827e292c1c78ecb13cc","name":"Bing He","hidden":false},{"_id":"6a1fa827e292c1c78ecb13cd","name":"Minhua Lin","hidden":false},{"_id":"6a1fa827e292c1c78ecb13ce","name":"Tianxin Wei","hidden":false},{"_id":"6a1fa827e292c1c78ecb13cf","name":"Dakuo Wang","hidden":false},{"_id":"6a1fa827e292c1c78ecb13d0","name":"Benoit Dumoulin","hidden":false},{"_id":"6a1fa827e292c1c78ecb13d1","name":"Wei Jin","hidden":false},{"_id":"6a1fa827e292c1c78ecb13d2","name":"Hanqing Lu","hidden":false}],"publishedAt":"2026-06-01T06:51:14.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams","submittedOnDailyBy":{"_id":"6a1e20a6fbe4403542c39710","avatarUrl":"/avatars/9e865914e418b2415c218584d1b48e79.svg","isPro":false,"fullname":"Zewen Liu","user":"nuuuh","type":"user","name":"nuuuh"},"summary":"Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open-ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These challenges make a single repeatedly and densely updated harness brittle, causing performance degradation as accuracy peaks early and then declines. This motivates sustained harness construction with task-wise adaptation. We introduce Adaptive Auto-Harness, a framework and system for such streams. The framework decomposes the gap to an oracle harness into evolution loss and adaptation loss. The system addresses these losses with a stateful multi-agent evolver, a harness tree with solve-time routing, and human-steering hooks for cases where history lacks the needed signal. Across prediction-market, security-competition, and event-forecasting streams, Adaptive Auto-Harness outperforms five existing auto-harness baselines and ablations attribute gains to better construction, routing, or targeted human steering. Code is available in https://github.com/A-EVO-Lab/AdaptiveHarness .","upvotes":9,"discussionId":"6a1fa827e292c1c78ecb13d3","ai_summary":"Adaptive Auto-Harness framework addresses dynamic task streams by decomposing performance gaps into evolution and adaptation losses, utilizing a stateful multi-agent evolver and harness tree with solve-time routing for sustained performance improvement.","ai_keywords":["auto-harness systems","LLM agents","prompt optimization","skill optimization","tool optimization","memory optimization","evolution loss","adaptation loss","stateful multi-agent evolver","harness tree","solve-time routing","human-steering hooks"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a1e20a6fbe4403542c39710","avatarUrl":"/avatars/9e865914e418b2415c218584d1b48e79.svg","isPro":false,"fullname":"Zewen Liu","user":"nuuuh","type":"user"},{"_id":"698391f7c79652c087ecd076","avatarUrl":"/avatars/2ec759f1f85486248b3da09bbc0f7d41.svg","isPro":false,"fullname":"Hanqing Lu","user":"HenryLuAI","type":"user"},{"_id":"647bc5022d27d3541df04a91","avatarUrl":"/avatars/e062092036da6c38f700362c0c5437c0.svg","isPro":false,"fullname":"Tianxin Wei","user":"tianxinwei","type":"user"},{"_id":"64b9cf13c41bf217c125513b","avatarUrl":"/avatars/6d9b06ff463201d96741081121a08849.svg","isPro":false,"fullname":"zhan","user":"zzsamshi","type":"user"},{"_id":"65f8ae0f6c02ff2f6d772f7e","avatarUrl":"/avatars/cb3798f4a7f55f928ed2f5ead0407d36.svg","isPro":false,"fullname":"Minhua Lin","user":"ventr1c","type":"user"},{"_id":"63eb1386a332618465d81ee2","avatarUrl":"/avatars/7953197c249e058b1039b487e2e99d4a.svg","isPro":false,"fullname":"Bing He","user":"binghe2727","type":"user"},{"_id":"64efe4cf82c6eea604baffb2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64efe4cf82c6eea604baffb2/SLmCIJqLlEdOPNvTCgFR4.jpeg","isPro":false,"fullname":"Zijun Wang","user":"Olivia714","type":"user"},{"_id":"684ff37fa383bc5d6b0ff77f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/0JPr-cd_rxQz3k6rmzBOF.png","isPro":false,"fullname":"JiaqiLiu","user":"JiaaqiLiu","type":"user"},{"_id":"69830f84b6f4dabcd5fc19c3","avatarUrl":"/avatars/d0c3c4c11e70e0527632d623c993f60d.svg","isPro":false,"fullname":"Grace Fontaine","user":"grace-01","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.01770.md"}">
Adaptive Auto-Harness: Sustained Self-Improvement for Agentic System Deployment on Open-Ended Task Streams
Abstract
Adaptive Auto-Harness framework addresses dynamic task streams by decomposing performance gaps into evolution and adaptation losses, utilizing a stateful multi-agent evolver and harness tree with solve-time routing for sustained performance improvement.
Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open-ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These challenges make a single repeatedly and densely updated harness brittle, causing performance degradation as accuracy peaks early and then declines. This motivates sustained harness construction with task-wise adaptation. We introduce Adaptive Auto-Harness, a framework and system for such streams. The framework decomposes the gap to an oracle harness into evolution loss and adaptation loss. The system addresses these losses with a stateful multi-agent evolver, a harness tree with solve-time routing, and human-steering hooks for cases where history lacks the needed signal. Across prediction-market, security-competition, and event-forecasting streams, Adaptive Auto-Harness outperforms five existing auto-harness baselines and ablations attribute gains to better construction, routing, or targeted human steering. Code is available in https://github.com/A-EVO-Lab/AdaptiveHarness .
Community
This paper studies auto-harness LLM agents — which improve agent system by editing harness (prompts/skills/tools/memories) instead of model weights —
under open-ended task streams, where a single densely-updated harness goes
brittle (accuracy peaks early, then declines). It frames the gap to an oracle
harness as two losses: evolution loss (the evolver's limited ability to build
good harnesses from history) and adaptation loss (committing to one harness
before seeing the task).
The system reduces evolution loss with a stateful multi-agent evolver
(Analyst→Researchers→Builder→Verifier, cross-cycle memory, temporal-reveal
feedback) and adaptation loss with a harness tree plus solve-time routing.
Across prediction-market, CTF, and event-forecasting streams it beats five
auto-harness baselines (e.g. PolyBench 80.9% accuracy, +330 coverage-scaled
return). Code: https://github.com/A-EVO-Lab/a-evolve/tree/release/adaptive-auto-harness
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.01770 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.01770 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.01770 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.