Hugging Face Daily Papers · · 5 min read

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Excited to share our paper: <strong>Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark</strong>.</p>\n<p>We introduce <strong>PhySciBench</strong>, a benchmark for evaluating deep-research agents in the physical sciences, and <strong>DelveAgent</strong>, a modular multi-agent framework for more reliable autonomous scientific reasoning.</p>\n<p>PhySciBench contains 200 expert-curated questions across physics and chemistry, covering multimodal QA, long-context QA, structured information extraction, scientific reasoning, experimental design, and code generation. We find that current frontier systems still struggle: the strongest baseline, Gemini Deep Research, reaches 33.5% accuracy.</p>\n<p>Based on this failure analysis, DelveAgent adds adaptive planning, dual-granularity memory, and physics-grounded hierarchical reflection, improving accuracy by up to 7.5 percentage points while reducing inference cost to around one-third of the strongest baseline.</p>\n<p>Dataset: <a href=\"https://huggingface.co/datasets/yigengx/PhySciBench\">https://huggingface.co/datasets/yigengx/PhySciBench</a><br>Code: <a href=\"https://github.com/yigengjiang/physci-deepresearch\" rel=\"nofollow\">https://github.com/yigengjiang/physci-deepresearch</a></p>\n","updatedAt":"2026-06-23T02:03:13.077Z","author":{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","fullname":"yigengjiang","name":"yigengx","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.813396155834198},"editors":["yigengx"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18648","authors":[{"_id":"6a379921db23715e9da13475","user":{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","isPro":false,"fullname":"yigengjiang","user":"yigengx","type":"user","name":"yigengx"},"name":"Yigeng Jiang","status":"claimed_verified","statusLastChangedAt":"2026-06-22T16:13:04.037Z","hidden":false},{"_id":"6a379921db23715e9da13476","name":"Tengchao Yang","hidden":false},{"_id":"6a379921db23715e9da13477","name":"Taoyong Cui","hidden":false},{"_id":"6a379921db23715e9da13478","name":"Jiaxing Wan","hidden":false},{"_id":"6a379921db23715e9da13479","name":"Yuan Wang","hidden":false},{"_id":"6a379921db23715e9da1347a","name":"Weida Wang","hidden":false},{"_id":"6a379921db23715e9da1347b","name":"Zhiyu Liu","hidden":false},{"_id":"6a379921db23715e9da1347c","name":"Chuyi Peng","hidden":false},{"_id":"6a379921db23715e9da1347d","name":"Binzhao Luo","hidden":false},{"_id":"6a379921db23715e9da1347e","name":"Maoli Gao","hidden":false},{"_id":"6a379921db23715e9da1347f","name":"Huaihai Huang","hidden":false},{"_id":"6a379921db23715e9da13480","name":"Yuqianer Zeng","hidden":false},{"_id":"6a379921db23715e9da13481","name":"Ziyang Zheng","hidden":false},{"_id":"6a379921db23715e9da13482","name":"Dongchen Huang","hidden":false},{"_id":"6a379921db23715e9da13483","name":"Chao Chen","hidden":false},{"_id":"6a379921db23715e9da13484","name":"Zichao Liu","hidden":false},{"_id":"6a379921db23715e9da13485","name":"Weiping Shen","hidden":false},{"_id":"6a379921db23715e9da13486","name":"Shuchen Pu","hidden":false},{"_id":"6a379921db23715e9da13487","name":"Siyu Zhou","hidden":false},{"_id":"6a379921db23715e9da13488","name":"Runmin Ma","hidden":false},{"_id":"6a379921db23715e9da13489","name":"Yusong Hu","hidden":false},{"_id":"6a379921db23715e9da1348a","name":"Fei Chao","hidden":false},{"_id":"6a379921db23715e9da1348b","name":"Bo Zhang","hidden":false},{"_id":"6a379921db23715e9da1348c","name":"Xiawu Zheng","hidden":false},{"_id":"6a379921db23715e9da1348d","name":"Zifu Wang","hidden":false},{"_id":"6a379921db23715e9da1348e","name":"Lei Bai","hidden":false},{"_id":"6a379921db23715e9da1348f","name":"Yunqi Cai","hidden":false},{"_id":"6a379921db23715e9da13490","name":"Shufei Zhang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/64c7a4401f9614c3e8852abe/CkH1AksBS0KLTEDhg4N4k.png"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark","submittedOnDailyBy":{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","isPro":false,"fullname":"yigengjiang","user":"yigengx","type":"user","name":"yigengx"},"summary":"Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.","upvotes":12,"discussionId":"6a379922db23715e9da13491","projectPage":"https://github.com/yigengjiang/physci-deepresearch","githubRepo":"https://github.com/yigengjiang/physci-deepresearch","githubRepoAddedBy":"user","ai_summary":"PhySciBench benchmark reveals limited performance of current LLM agents in physical science research, leading to development of DelveAgent framework that improves accuracy through modular design and physics-grounded mechanisms.","ai_keywords":["Large Language Model","scientific reasoning","physical science research","benchmark","agent systems","multi-agent framework","adaptive planning loop","dual-granularity memory","hierarchical reflection mechanism"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"6747ee5decec679eafb90450","name":"ShanghaiAiLab","fullname":"shanghai ailab ","avatar":"https://www.gravatar.com/avatar/6cd2acf412ad103653d9ce14a1aacc19?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","isPro":false,"fullname":"yigengjiang","user":"yigengx","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"6a3a0236d44d9013da213954","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a3a0236d44d9013da213954/TG5UDA83qw-XwUa3t_2jf.jpeg","isPro":false,"fullname":"Jiaxing Wan","user":"WANjx","type":"user"},{"_id":"676c04f44464f476aaa53d1c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/k488J1893F3JGwEMvaeuh.png","isPro":false,"fullname":"Chong Xia","user":"xiac24","type":"user"},{"_id":"67d184469fb867301d2e9276","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/zmcwpnvJecZsejRqvuUw1.png","isPro":false,"fullname":"luobinzhao","user":"luobinzhao","type":"user"},{"_id":"65342bfe484d775cb0bfecac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65342bfe484d775cb0bfecac/s53d6gBnSpmZRd1Dd9D5d.jpeg","isPro":false,"fullname":"Octavi Grau","user":"octavigrau","type":"user"},{"_id":"65bfce62262a04f94c238225","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/IcghnmNxCUurCSOEmChJ4.jpeg","isPro":false,"fullname":"meow","user":"yuuri7","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"685ae7dcaf974ed68b772ad8","avatarUrl":"/avatars/213a8206b1fc59f1b764e12db05faa44.svg","isPro":false,"fullname":"Dominic Savio","user":"dominic-savio","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6747ee5decec679eafb90450","name":"ShanghaiAiLab","fullname":"shanghai ailab ","avatar":"https://www.gravatar.com/avatar/6cd2acf412ad103653d9ce14a1aacc19?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18648.md","query":{}}">
Papers
arxiv:2606.18648

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Published on Jun 17
· Submitted by
yigengjiang
on Jun 23
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

PhySciBench benchmark reveals limited performance of current LLM agents in physical science research, leading to development of DelveAgent framework that improves accuracy through modular design and physics-grounded mechanisms.

Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.

Community

Paper author Paper submitter about 23 hours ago

Excited to share our paper: Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark.

We introduce PhySciBench, a benchmark for evaluating deep-research agents in the physical sciences, and DelveAgent, a modular multi-agent framework for more reliable autonomous scientific reasoning.

PhySciBench contains 200 expert-curated questions across physics and chemistry, covering multimodal QA, long-context QA, structured information extraction, scientific reasoning, experimental design, and code generation. We find that current frontier systems still struggle: the strongest baseline, Gemini Deep Research, reaches 33.5% accuracy.

Based on this failure analysis, DelveAgent adds adaptive planning, dual-granularity memory, and physics-grounded hierarchical reflection, improving accuracy by up to 7.5 percentage points while reducing inference cost to around one-third of the strongest baseline.

Dataset: https://huggingface.co/datasets/yigengx/PhySciBench
Code: https://github.com/yigengjiang/physci-deepresearch

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.18648
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18648 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18648 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers