Hugging Face Daily Papers · June 11, 2026 · 8 min read

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

From single-turn chatbots, to multi-turn dialogue systems, and then to tool-using agents, we believe the next important stage is the rise of Autonomous Agents. However, many existing efforts are either tightly bound to specific scenarios and single tasks, or remain at the research-prototype stage without being truly deployable in practice. This raises a central question: what should a general and practical autonomous agent look like?\nIn our new work, Toward Generalist Autonomous Research via Hypothesis-Tree Refinement, we present our answer: Arbor. Automated research should not be reduced to repeated trial-and-error. Instead, it should explore in a structured way, organizing hypotheses, evidence, failures, and accumulated experience into an evolving research state, much like the process of real scientific inquiry. Each new attempt should build upon the discoveries and lessons from previous explorations.\nArbor first emphasizes generality. It is not tied to a particular benchmark or task format. Instead, it unifies diverse research tasks, including model training, harness engineering, and data synthesis, under the framework of Autonomous Optimization. As long as there is an artifact to optimize, a clear objective, and executable feedback signals, Arbor can conduct long-horizon search and iterative improvement around it.\nArbor also emphasizes practicality. It is not merely a paper idea or a research prototype confined to the lab. We open-source a fully runnable CLI and an Agent Skill Suite. Users can directly run the complete Arbor CLI for long-horizon automated research experiments, or load Arbor-style skills into environments such as Codex and Claude Code, enabling existing coding agents to gain more structured autonomous research capabilities.\nArbor supports long-running experiments in real codebases, disciplined dev/test evaluation, git worktree isolation, checkpoint/resume, dashboard and report generation, and one-line plugin adaptation for different task types. Our goal is to move auto-research from a conceptual vision toward a truly usable system.\n","updatedAt":"2026-06-11T02:58:59.987Z","author":{"_id":"6544b9b646dbdeca34ee5f52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6544b9b646dbdeca34ee5f52/nRx6m1C4wfZ_xSWoBUNJf.png","fullname":"Yuyang Hu","name":"namespace-ERI","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9055745005607605},"editors":["namespace-ERI"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6544b9b646dbdeca34ee5f52/nRx6m1C4wfZ_xSWoBUNJf.png"],"reactions":[{"reaction":"👍","users":["dongguanting","jinjiajie"],"count":2},{"reaction":"❤️","users":["dongguanting","jinjiajie"],"count":2}],"isReport":false}},{"id":"6a2a2a33ae970f9bb999ac78","author":{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","fullname":"KABI","name":"dongguanting","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":76,"isUserFollowing":false},"createdAt":"2026-06-11T03:23:31.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Interesting work in autonomous research!","html":"Interesting work in autonomous research!\n","updatedAt":"2026-06-11T03:23:31.587Z","author":{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","fullname":"KABI","name":"dongguanting","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":76,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7952984571456909},"editors":["dongguanting"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png"],"reactions":[],"isReport":false}},{"id":"6a2aa693e9ddaf2c0d15cae8","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-11T12:14:11.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Cool paper - I liked the way \"Toward Generalist Autonomous Research via Hypothesis-Tree Refinement\" frames the problem without making it feel too abstract.\n\nCurious if you think this would still work once the setup gets messier in the wild?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/5bcda69b-d4ea-445e-80d7-3a09392578fc","html":"Cool paper - I liked the way \"Toward Generalist Autonomous Research via Hypothesis-Tree Refinement\" frames the problem without making it feel too abstract.\nCurious if you think this would still work once the setup gets messier in the wild?\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go: <a href=\"https://researchpod.app/episode/5bcda69b-d4ea-445e-80d7-3a09392578fc\" rel=\"nofollow\">https://researchpod.app/episode/5bcda69b-d4ea-445e-80d7-3a09392578fc</a>\n","updatedAt":"2026-06-11T12:14:11.999Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8648263812065125},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11926","authors":[{"_id":"6a2a238680a9c7c6830c0f1c","name":"Jiajie Jin","hidden":false},{"_id":"6a2a238680a9c7c6830c0f1d","user":{"_id":"6544b9b646dbdeca34ee5f52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6544b9b646dbdeca34ee5f52/nRx6m1C4wfZ_xSWoBUNJf.png","isPro":false,"fullname":"Yuyang Hu","user":"namespace-ERI","type":"user","name":"namespace-ERI"},"name":"Yuyang Hu","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:18.829Z","hidden":false},{"_id":"6a2a238680a9c7c6830c0f1e","name":"Kai Qiu","hidden":false},{"_id":"6a2a238680a9c7c6830c0f1f","name":"Qi Dai","hidden":false},{"_id":"6a2a238680a9c7c6830c0f20","name":"Chong Luo","hidden":false},{"_id":"6a2a238680a9c7c6830c0f21","user":{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","isPro":false,"fullname":"KABI","user":"dongguanting","type":"user","name":"dongguanting"},"name":"Guanting Dong","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:16.710Z","hidden":false},{"_id":"6a2a238680a9c7c6830c0f22","name":"Xiaoxi Li","hidden":false},{"_id":"6a2a238680a9c7c6830c0f23","name":"Tong Zhao","hidden":false},{"_id":"6a2a238680a9c7c6830c0f24","name":"Xiaolong Ma","hidden":false},{"_id":"6a2a238680a9c7c6830c0f25","name":"Gongrui Zhang","hidden":false},{"_id":"6a2a238680a9c7c6830c0f26","name":"Zhirong Wu","hidden":false},{"_id":"6a2a238680a9c7c6830c0f27","name":"Bei Liu","hidden":false},{"_id":"6a2a238680a9c7c6830c0f28","name":"Zhengyuan Yang","hidden":false},{"_id":"6a2a238680a9c7c6830c0f29","name":"Linjie Li","hidden":false},{"_id":"6a2a238680a9c7c6830c0f2a","name":"Lijuan Wang","hidden":false},{"_id":"6a2a238680a9c7c6830c0f2b","name":"Hongjin Qian","hidden":false},{"_id":"6a2a238680a9c7c6830c0f2c","name":"Yutao Zhu","hidden":false},{"_id":"6a2a238680a9c7c6830c0f2d","name":"Zhicheng Dou","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6544b9b646dbdeca34ee5f52/oR8IjFj2gazUkimyf1o7n.mp4"],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"Toward Generalist Autonomous Research via Hypothesis-Tree Refinement","submittedOnDailyBy":{"_id":"6544b9b646dbdeca34ee5f52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6544b9b646dbdeca34ee5f52/nRx6m1C4wfZ_xSWoBUNJf.png","isPro":false,"fullname":"Yuyang Hu","user":"namespace-ERI","type":"user","name":"namespace-ERI"},"summary":"Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.","upvotes":67,"discussionId":"6a2a238680a9c7c6830c0f2e","projectPage":"https://ruc-nlpir.github.io/Arbor/","githubRepo":"https://github.com/RUC-NLPIR/Arbor","githubRepoAddedBy":"user","ai_summary":"An AI framework called Arbor enables autonomous scientific research by combining strategic coordination, isolated hypothesis testing, and a persistent knowledge tree to iteratively improve research outcomes across multiple domains.","ai_keywords":["autonomous research","long horizons","Hypothesis Tree Refinement","coordinator","executors","worktrees","iterative experimentation","research artifact","held-out result","MLE-Bench Lite"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":63,"organization":{"_id":"6695ed048765c1560ce56423","name":"RUC-NLPIR","fullname":"NLPIR Lab @ RUC","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/625e62452a7279d3c77b5c38/CBwmyPCRzm4rHTGWhiCzR.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6544b9b646dbdeca34ee5f52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6544b9b646dbdeca34ee5f52/nRx6m1C4wfZ_xSWoBUNJf.png","isPro":false,"fullname":"Yuyang Hu","user":"namespace-ERI","type":"user"},{"_id":"6695f14df0ffd8e3a379ad61","avatarUrl":"/avatars/5ebb7e55ee9c2d93850b279f440675b0.svg","isPro":false,"fullname":"Jiajie Jin","user":"jinjiajie","type":"user"},{"_id":"6639d5c106b25a7ea6f18391","avatarUrl":"/avatars/788e339472999a9159f77f857817d618.svg","isPro":false,"fullname":"Ziliang Zhao","user":"ZillionZhao","type":"user"},{"_id":"6621ec2524eb2673fe0790fc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6621ec2524eb2673fe0790fc/cooTXi12eRWFiSSIj_nA-.jpeg","isPro":false,"fullname":"Ania Forge","user":"zhangboguodong","type":"user"},{"_id":"64a627232944e255ef574dda","avatarUrl":"/avatars/4c2fd5bf922013fe691c6a3e3fa138a2.svg","isPro":false,"fullname":"Hongjin Qian","user":"TommyChien","type":"user"},{"_id":"64bdfa1a1a62149c5e80ef6f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Wjc9gPFzlARBkdoTAOZm8.png","isPro":false,"fullname":"Yuyao Zhang","user":"KeriaZhang","type":"user"},{"_id":"664c4ddf4bea570e25cb4cc9","avatarUrl":"/avatars/13c805437efd34c5e6b7a3a9c229696a.svg","isPro":false,"fullname":"Vincent zhao","user":"Tung111","type":"user"},{"_id":"66e03eace17fb5ff054b7686","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66e03eace17fb5ff054b7686/PpSV0Qo5lwTyxIZMp57xq.jpeg","isPro":false,"fullname":"Xiaoxi Li","user":"lixiaoxi45","type":"user"},{"_id":"65dd88b71f7352669d65f4f5","avatarUrl":"/avatars/0cef87a5a40ddbc5530b31991862de28.svg","isPro":false,"fullname":"jiongnan liu","user":"liujiongnan","type":"user"},{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","isPro":false,"fullname":"KABI","user":"dongguanting","type":"user"},{"_id":"625e62452a7279d3c77b5c38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/625e62452a7279d3c77b5c38/zJINew6U4_Gup4WTobb-0.jpeg","isPro":false,"fullname":"Yutao Zhu","user":"yutaozhu94","type":"user"},{"_id":"66fa662a01ab1cdf367abf81","avatarUrl":"/avatars/4666eb0cdd619ecdcaf883f16b2a361d.svg","isPro":false,"fullname":"Zhang Zhang","user":"ZZhangZZ","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"6695ed048765c1560ce56423","name":"RUC-NLPIR","fullname":"NLPIR Lab @ RUC","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/625e62452a7279d3c77b5c38/CBwmyPCRzm4rHTGWhiCzR.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11926.md"}">

Papers

arxiv:2606.11926

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Published on Jun 10

· Submitted by

Yuyang Hu on Jun 11

#2 Paper of the day

NLPIR Lab @ RUC

Upvote

Authors:

Yuyang Hu ,

Guanting Dong ,

Abstract

An AI framework called Arbor enables autonomous scientific research by combining strategic coordination, isolated hypothesis testing, and a persistent knowledge tree to iteratively improve research outcomes across multiple domains.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Scientific progress depends on a repeated loop of exploration, experimentation, and abstraction. Researchers test candidate directions, interpret the evidence, and carry the resulting lessons into later attempts. We study how an AI agent can run this loop autonomously over long horizons. We introduce Arbor, a general framework for autonomous research that combines a long-lived coordinator, short-lived executors, and Hypothesis Tree Refinement (HTR), a persistent tree that links hypotheses, artifacts, evidence, and distilled insights across time. The coordinator manages global research strategy over the tree, while executors implement and test individual hypotheses in isolated worktrees. As results return, Arbor updates the tree, propagates reusable lessons, refines the search frontier, and admits verified improvements. This design turns autonomous research from a sequence of local attempts into a cumulative process in which strategy, execution, and evidence are carried across time. We evaluate Arbor under Autonomous Optimization (AO), an operational setting where an agent improves an initial research artifact through iterative experimentation without step-level human supervision. Across six real research tasks in model training, harness engineering, and data synthesis, Arbor achieves the best held-out result on all six tasks, attaining more than 2.5x the average relative held-out gain of Codex and Claude Code under the same task interface and resource budget. On MLE-Bench Lite, Arbor reaches 86.36% Any Medal with GPT-5.5, the strongest result in our comparison.

View arXiv page View PDF Project page GitHub 63 Add to collection

Community

namespace-ERI

Paper author Paper submitter about 17 hours ago

In our new work, Toward Generalist Autonomous Research via Hypothesis-Tree Refinement, we present our answer: Arbor. Automated research should not be reduced to repeated trial-and-error. Instead, it should explore in a structured way, organizing hypotheses, evidence, failures, and accumulated experience into an evolving research state, much like the process of real scientific inquiry. Each new attempt should build upon the discoveries and lessons from previous explorations.

Arbor first emphasizes generality. It is not tied to a particular benchmark or task format. Instead, it unifies diverse research tasks, including model training, harness engineering, and data synthesis, under the framework of Autonomous Optimization. As long as there is an artifact to optimize, a clear objective, and executable feedback signals, Arbor can conduct long-horizon search and iterative improvement around it.

Arbor also emphasizes practicality. It is not merely a paper idea or a research prototype confined to the lab. We open-source a fully runnable CLI and an Agent Skill Suite. Users can directly run the complete Arbor CLI for long-horizon automated research experiments, or load Arbor-style skills into environments such as Codex and Claude Code, enabling existing coding agents to gain more structured autonomous research capabilities.

Arbor supports long-running experiments in real codebases, disciplined dev/test evaluation, git worktree isolation, checkpoint/resume, dashboard and report generation, and one-line plugin adaptation for different task types. Our goal is to move auto-research from a conceptual vision toward a truly usable system.

dongguanting

Paper author about 17 hours ago

Interesting work in autonomous research!

noahml

about 8 hours ago

Cool paper - I liked the way "Toward Generalist Autonomous Research via Hypothesis-Tree Refinement" frames the problem without making it feel too abstract.

Curious if you think this would still work once the setup gets messier in the wild?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/5bcda69b-d4ea-445e-80d7-3a09392578fc

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.11926

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.11926 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.11926 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.11926 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers