From single-turn chatbots, to multi-turn dialogue systems, and then to tool-using agents, we believe the next important stage is the rise of Autonomous Agents. However, many existing efforts are either tightly bound to specific scenarios and single tasks, or remain at the research-prototype stage without being truly deployable in practice. This raises a central question: what should a general and practical autonomous agent look like?</p>\n<p>In our new work, Toward Generalist Autonomous Research via Hypothesis-Tree Refinement, we present our answer: Arbor. Automated research should not be reduced to repeated trial-and-error. Instead, it should explore in a structured way, organizing hypotheses, evidence, failures, and accumulated experience into an evolving research state, much like the process of real scientific inquiry. Each new attempt should build upon the discoveries and lessons from previous explorations.</p>\n<p>Arbor first emphasizes generality. It is not tied to a particular benchmark or task format. Instead, it unifies diverse research tasks, including model training, harness engineering, and data synthesis, under the framework of Autonomous Optimization. As long as there is an artifact to optimize, a clear objective, and executable feedback signals, Arbor can conduct long-horizon search and iterative improvement around it.</p>\n<p>Arbor also emphasizes practicality. It is not merely a paper idea or a research prototype confined to the lab. We open-source a fully runnable CLI and an Agent Skill Suite. Users can directly run the complete Arbor CLI for long-horizon automated research experiments, or load Arbor-style skills into environments such as Codex and Claude Code, enabling existing coding agents to gain more structured autonomous research capabilities.</p>\n<p>Arbor supports long-running experiments in real codebases, disciplined dev/test evaluation, git worktree isolation, checkpoint/resume, dashboard and report generation, and one-line plugin adaptation for different task types. Our goal is to move auto-research from a conceptual vision toward a truly usable system.</p>\n","updatedAt":"2026-06-11T02:58:59.987Z","author":{"_id":"6544b9b646dbdeca34ee5f52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6544b9b646dbdeca34ee5f52/nRx6m1C4wfZ_xSWoBUNJf.png","fullname":"Yuyang Hu","name":"namespace-ERI","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9055745005607605},"editors":["namespace-ERI"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6544b9b646dbdeca34ee5f52/nRx6m1C4wfZ_xSWoBUNJf.png"],"reactions":[{"reaction":"👍","users":["dongguanting","jinjiajie"],"count":2},{"reaction":"❤️","users":["dongguanting","jinjiajie"],"count":2}],"isReport":false}},{"id":"6a2a2a33ae970f9bb999ac78","author":{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","fullname":"KABI","name":"dongguanting","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":76,"isUserFollowing":false},"createdAt":"2026-06-11T03:23:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Interesting work in autonomous research!","html":"<p>Interesting work in autonomous research!</p>\n","updatedAt":"2026-06-11T03:23:31.587Z","author":{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","fullname":"KABI","name":"dongguanting","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":76,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7952984571456909},"editors":["dongguanting"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png"],"reactions":[],"isReport":false}},{"id":"6a2aa693e9ddaf2c0d15cae8","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-11T12:14:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Cool paper - I liked the way \"Toward Generalist Autonomous Research via Hypothesis-Tree Refinement\" frames the problem without making it feel too abstract.\n\nCurious if you think this would still work once the setup gets messier in the wild?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/5bcda69b-d4ea-445e-80d7-3a09392578fc","html":"<p>Cool paper - I liked the way \"Toward Generalist Autonomous Research via Hypothesis-Tree Refinement\" frames the problem without making it feel too abstract.</p>\n<p>Curious if you think this would still work once the setup gets messier in the wild?</p>\n<p>I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:<br><a href=\"https://researchpod.app/episode/5bcda69b-d4ea-445e-80d7-3a09392578fc\" rel=\"nofollow\">https://researchpod.app/episode/5bcda69b-d4ea-445e-80d7-3a09392578fc</a></p>\n","updatedAt":"2026-06-11T12:14:11.999Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8648263812065125},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11926","authors":[{"_id":"6a2a238680a9c7c6830c0f1c","name":"Jiajie Jin","hidden":false},{"_id":"6a2a238680a9c7c6830c0f1d","user":{"_id":"6544b9b646dbdeca34ee5f52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6544b9b646dbdeca34ee5f52/nRx6m1C4wfZ_xSWoBUNJf.png","isPro":false,"fullname":"Yuyang Hu","user":"namespace-ERI","type":"user","name":"namespace-ERI"},"name":"Yuyang Hu","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:18.829Z","hidden":false},{"_id":"6a2a238680a9c7c6830c0f1e","name":"Kai Qiu","hidden":false},{"_id":"6a2a238680a9c7c6830c0f1f","name":"Qi Dai","hidden":false},{"_id":"6a2a238680a9c7c6830c0f20","name":"Chong Luo","hidden":false},{"_id":"6a2a238680a9c7c6830c0f21","user":{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","isPro":false,"fullname":"KABI","user":"dongguanting","type":"user","name":"dongguanting"},"name":"Guanting Dong","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:16.710Z","hidden":false},{"_id":"6a2a238680a9c7c6830c0f22","name":"Xiaoxi Li","hidden":false},{"_id":"6a2a238680a9c7c6830c0f23","name":"Tong Zhao","hidden":false},{"_id":"6a2a238680a9c7c6830c0f24","name":"Xiaolong Ma","hidden":false},{"_id":"6a2a238680a9c7c6830c0f25","name":"Gongrui Zhang","hidden":false},{"_id":"6a2a238680a9c7c6830c0f26","name":"Zhirong Wu","hidden":false},{"_id":"6a2a238680a9c7c6830c0f27","name":"Bei Liu","hidden":false},{"_id":"6a2a238680a9c7c6830c0f28","name":"Zhengyuan Yang","hidden":false},{"_id":"6a2a238680a9c7c6830c0f29","name":"Linjie Li","hidden":false},{"_id":"6a2a238680a9c7c6830c0f2a","name":"Lijuan Wang","hidden":false},{"_id":"6a2a238680a9c7c6830c0f2b","name":"Hongjin Qian","hidden":false},{"_id":"6a2a238680a9c7c6830c0f2c","name":"Yutao Zhu","hidden":false},{"_id":"6a2a238680a9c7c6830c0f2d","name":"Zhicheng Dou","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6544b9b646dbdeca34ee5f52/oR8IjFj2gazUkimyf1o7n.mp4"],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"Toward Generalist Autonomous Research via Hypothesis-Tree Refinement","submittedOnDailyBy":{"_id":"6544b9b646dbdeca34ee5f52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6544b9b646dbdeca34ee5f52/nRx6m1C4wfZ_xSWoBUNJf.png","isPro":false,"fullname":"Yuyang Hu","user":"namespace-ERI","type":"user","name":"namespace-ERI"},"summary":"Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.","upvotes":67,"discussionId":"6a2a238680a9c7c6830c0f2e","projectPage":"https://ruc-nlpir.github.io/Arbor/","githubRepo":"https://github.com/RUC-NLPIR/Arbor","githubRepoAddedBy":"user","ai_summary":"An AI framework called Arbor enables autonomous scientific research by combining strategic coordination, isolated hypothesis testing, and a persistent knowledge tree to iteratively improve research outcomes across multiple domains.","ai_keywords":["autonomous research","long horizons","Hypothesis Tree Refinement","coordinator","executors","worktrees","iterative experimentation","research artifact","held-out result","MLE-Bench Lite"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":63,"organization":{"_id":"6695ed048765c1560ce56423","name":"RUC-NLPIR","fullname":"NLPIR Lab @ RUC","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/625e62452a7279d3c77b5c38/CBwmyPCRzm4rHTGWhiCzR.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6544b9b646dbdeca34ee5f52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6544b9b646dbdeca34ee5f52/nRx6m1C4wfZ_xSWoBUNJf.png","isPro":false,"fullname":"Yuyang Hu","user":"namespace-ERI","type":"user"},{"_id":"6695f14df0ffd8e3a379ad61","avatarUrl":"/avatars/5ebb7e55ee9c2d93850b279f440675b0.svg","isPro":false,"fullname":"Jiajie Jin","user":"jinjiajie","type":"user"},{"_id":"6639d5c106b25a7ea6f18391","avatarUrl":"/avatars/788e339472999a9159f77f857817d618.svg","isPro":false,"fullname":"Ziliang Zhao","user":"ZillionZhao","type":"user"},{"_id":"6621ec2524eb2673fe0790fc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6621ec2524eb2673fe0790fc/cooTXi12eRWFiSSIj_nA-.jpeg","isPro":false,"fullname":"Ania Forge","user":"zhangboguodong","type":"user"},{"_id":"64a627232944e255ef574dda","avatarUrl":"/avatars/4c2fd5bf922013fe691c6a3e3fa138a2.svg","isPro":false,"fullname":"Hongjin Qian","user":"TommyChien","type":"user"},{"_id":"64bdfa1a1a62149c5e80ef6f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Wjc9gPFzlARBkdoTAOZm8.png","isPro":false,"fullname":"Yuyao Zhang","user":"KeriaZhang","type":"user"},{"_id":"664c4ddf4bea570e25cb4cc9","avatarUrl":"/avatars/13c805437efd34c5e6b7a3a9c229696a.svg","isPro":false,"fullname":"Vincent zhao","user":"Tung111","type":"user"},{"_id":"66e03eace17fb5ff054b7686","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66e03eace17fb5ff054b7686/PpSV0Qo5lwTyxIZMp57xq.jpeg","isPro":false,"fullname":"Xiaoxi Li","user":"lixiaoxi45","type":"user"},{"_id":"65dd88b71f7352669d65f4f5","avatarUrl":"/avatars/0cef87a5a40ddbc5530b31991862de28.svg","isPro":false,"fullname":"jiongnan liu","user":"liujiongnan","type":"user"},{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","isPro":false,"fullname":"KABI","user":"dongguanting","type":"user"},{"_id":"625e62452a7279d3c77b5c38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/625e62452a7279d3c77b5c38/zJINew6U4_Gup4WTobb-0.jpeg","isPro":false,"fullname":"Yutao Zhu","user":"yutaozhu94","type":"user"},{"_id":"66fa662a01ab1cdf367abf81","avatarUrl":"/avatars/4666eb0cdd619ecdcaf883f16b2a361d.svg","isPro":false,"fullname":"Zhang Zhang","user":"ZZhangZZ","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"6695ed048765c1560ce56423","name":"RUC-NLPIR","fullname":"NLPIR Lab @ RUC","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/625e62452a7279d3c77b5c38/CBwmyPCRzm4rHTGWhiCzR.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11926.md"}">
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
An AI framework called Arbor enables autonomous scientific research by combining strategic coordination, isolated hypothesis testing, and a persistent knowledge tree to iteratively improve research outcomes across multiple domains.
Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.
Community
From single-turn chatbots, to multi-turn dialogue systems, and then to tool-using agents, we believe the next important stage is the rise of Autonomous Agents. However, many existing efforts are either tightly bound to specific scenarios and single tasks, or remain at the research-prototype stage without being truly deployable in practice. This raises a central question: what should a general and practical autonomous agent look like?
In our new work, Toward Generalist Autonomous Research via Hypothesis-Tree Refinement, we present our answer: Arbor. Automated research should not be reduced to repeated trial-and-error. Instead, it should explore in a structured way, organizing hypotheses, evidence, failures, and accumulated experience into an evolving research state, much like the process of real scientific inquiry. Each new attempt should build upon the discoveries and lessons from previous explorations.
Arbor first emphasizes generality. It is not tied to a particular benchmark or task format. Instead, it unifies diverse research tasks, including model training, harness engineering, and data synthesis, under the framework of Autonomous Optimization. As long as there is an artifact to optimize, a clear objective, and executable feedback signals, Arbor can conduct long-horizon search and iterative improvement around it.
Arbor also emphasizes practicality. It is not merely a paper idea or a research prototype confined to the lab. We open-source a fully runnable CLI and an Agent Skill Suite. Users can directly run the complete Arbor CLI for long-horizon automated research experiments, or load Arbor-style skills into environments such as Codex and Claude Code, enabling existing coding agents to gain more structured autonomous research capabilities.
Arbor supports long-running experiments in real codebases, disciplined dev/test evaluation, git worktree isolation, checkpoint/resume, dashboard and report generation, and one-line plugin adaptation for different task types. Our goal is to move auto-research from a conceptual vision toward a truly usable system.
Interesting work in autonomous research!
Cool paper - I liked the way "Toward Generalist Autonomous Research via Hypothesis-Tree Refinement" frames the problem without making it feel too abstract.
Curious if you think this would still work once the setup gets messier in the wild?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/5bcda69b-d4ea-445e-80d7-3a09392578fc
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.11926 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.11926 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.11926 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.