Excited to share our paper: <strong>Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark</strong>.</p>\n<p>We introduce <strong>PhySciBench</strong>, a benchmark for evaluating deep-research agents in the physical sciences, and <strong>DelveAgent</strong>, a modular multi-agent framework for more reliable autonomous scientific reasoning.</p>\n<p>PhySciBench contains 200 expert-curated questions across physics and chemistry, covering multimodal QA, long-context QA, structured information extraction, scientific reasoning, experimental design, and code generation. We find that current frontier systems still struggle: the strongest baseline, Gemini Deep Research, reaches 33.5% accuracy.</p>\n<p>Based on this failure analysis, DelveAgent adds adaptive planning, dual-granularity memory, and physics-grounded hierarchical reflection, improving accuracy by up to 7.5 percentage points while reducing inference cost to around one-third of the strongest baseline.</p>\n<p>Dataset: <a href=\"https://huggingface.co/datasets/yigengx/PhySciBench\">https://huggingface.co/datasets/yigengx/PhySciBench</a><br>Code: <a href=\"https://github.com/yigengjiang/physci-deepresearch\" rel=\"nofollow\">https://github.com/yigengjiang/physci-deepresearch</a></p>\n","updatedAt":"2026-06-23T02:03:13.077Z","author":{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","fullname":"yigengjiang","name":"yigengx","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.813396155834198},"editors":["yigengx"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18648","authors":[{"_id":"6a379921db23715e9da13475","user":{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","isPro":false,"fullname":"yigengjiang","user":"yigengx","type":"user","name":"yigengx"},"name":"Yigeng Jiang","status":"claimed_verified","statusLastChangedAt":"2026-06-22T16:13:04.037Z","hidden":false},{"_id":"6a379921db23715e9da13476","name":"Tengchao Yang","hidden":false},{"_id":"6a379921db23715e9da13477","name":"Taoyong Cui","hidden":false},{"_id":"6a379921db23715e9da13478","name":"Jiaxing Wan","hidden":false},{"_id":"6a379921db23715e9da13479","name":"Yuan Wang","hidden":false},{"_id":"6a379921db23715e9da1347a","name":"Weida Wang","hidden":false},{"_id":"6a379921db23715e9da1347b","name":"Zhiyu Liu","hidden":false},{"_id":"6a379921db23715e9da1347c","name":"Chuyi Peng","hidden":false},{"_id":"6a379921db23715e9da1347d","name":"Binzhao Luo","hidden":false},{"_id":"6a379921db23715e9da1347e","name":"Maoli Gao","hidden":false},{"_id":"6a379921db23715e9da1347f","name":"Huaihai Huang","hidden":false},{"_id":"6a379921db23715e9da13480","name":"Yuqianer Zeng","hidden":false},{"_id":"6a379921db23715e9da13481","name":"Ziyang Zheng","hidden":false},{"_id":"6a379921db23715e9da13482","name":"Dongchen Huang","hidden":false},{"_id":"6a379921db23715e9da13483","name":"Chao Chen","hidden":false},{"_id":"6a379921db23715e9da13484","name":"Zichao Liu","hidden":false},{"_id":"6a379921db23715e9da13485","name":"Weiping Shen","hidden":false},{"_id":"6a379921db23715e9da13486","name":"Shuchen Pu","hidden":false},{"_id":"6a379921db23715e9da13487","name":"Siyu Zhou","hidden":false},{"_id":"6a379921db23715e9da13488","name":"Runmin Ma","hidden":false},{"_id":"6a379921db23715e9da13489","name":"Yusong Hu","hidden":false},{"_id":"6a379921db23715e9da1348a","name":"Fei Chao","hidden":false},{"_id":"6a379921db23715e9da1348b","name":"Bo Zhang","hidden":false},{"_id":"6a379921db23715e9da1348c","name":"Xiawu Zheng","hidden":false},{"_id":"6a379921db23715e9da1348d","name":"Zifu Wang","hidden":false},{"_id":"6a379921db23715e9da1348e","name":"Lei Bai","hidden":false},{"_id":"6a379921db23715e9da1348f","name":"Yunqi Cai","hidden":false},{"_id":"6a379921db23715e9da13490","name":"Shufei Zhang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/64c7a4401f9614c3e8852abe/CkH1AksBS0KLTEDhg4N4k.png"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark","submittedOnDailyBy":{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","isPro":false,"fullname":"yigengjiang","user":"yigengx","type":"user","name":"yigengx"},"summary":"Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.","upvotes":12,"discussionId":"6a379922db23715e9da13491","projectPage":"https://github.com/yigengjiang/physci-deepresearch","githubRepo":"https://github.com/yigengjiang/physci-deepresearch","githubRepoAddedBy":"user","ai_summary":"PhySciBench benchmark reveals limited performance of current LLM agents in physical science research, leading to development of DelveAgent framework that improves accuracy through modular design and physics-grounded mechanisms.","ai_keywords":["Large Language Model","scientific reasoning","physical science research","benchmark","agent systems","multi-agent framework","adaptive planning loop","dual-granularity memory","hierarchical reflection mechanism"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"6747ee5decec679eafb90450","name":"ShanghaiAiLab","fullname":"shanghai ailab ","avatar":"https://www.gravatar.com/avatar/6cd2acf412ad103653d9ce14a1aacc19?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","isPro":false,"fullname":"yigengjiang","user":"yigengx","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"6a3a0236d44d9013da213954","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a3a0236d44d9013da213954/TG5UDA83qw-XwUa3t_2jf.jpeg","isPro":false,"fullname":"Jiaxing Wan","user":"WANjx","type":"user"},{"_id":"676c04f44464f476aaa53d1c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/k488J1893F3JGwEMvaeuh.png","isPro":false,"fullname":"Chong Xia","user":"xiac24","type":"user"},{"_id":"67d184469fb867301d2e9276","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/zmcwpnvJecZsejRqvuUw1.png","isPro":false,"fullname":"luobinzhao","user":"luobinzhao","type":"user"},{"_id":"65342bfe484d775cb0bfecac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65342bfe484d775cb0bfecac/s53d6gBnSpmZRd1Dd9D5d.jpeg","isPro":false,"fullname":"Octavi Grau","user":"octavigrau","type":"user"},{"_id":"65bfce62262a04f94c238225","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/IcghnmNxCUurCSOEmChJ4.jpeg","isPro":false,"fullname":"meow","user":"yuuri7","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"685ae7dcaf974ed68b772ad8","avatarUrl":"/avatars/213a8206b1fc59f1b764e12db05faa44.svg","isPro":false,"fullname":"Dominic Savio","user":"dominic-savio","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6747ee5decec679eafb90450","name":"ShanghaiAiLab","fullname":"shanghai ailab ","avatar":"https://www.gravatar.com/avatar/6cd2acf412ad103653d9ce14a1aacc19?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18648.md","query":{}}">
Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
PhySciBench benchmark reveals limited performance of current LLM agents in physical science research, leading to development of DelveAgent framework that improves accuracy through modular design and physics-grounded mechanisms.
Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.
Community
Excited to share our paper: Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark.
We introduce PhySciBench, a benchmark for evaluating deep-research agents in the physical sciences, and DelveAgent, a modular multi-agent framework for more reliable autonomous scientific reasoning.
PhySciBench contains 200 expert-curated questions across physics and chemistry, covering multimodal QA, long-context QA, structured information extraction, scientific reasoning, experimental design, and code generation. We find that current frontier systems still struggle: the strongest baseline, Gemini Deep Research, reaches 33.5% accuracy.
Based on this failure analysis, DelveAgent adds adaptive planning, dual-granularity memory, and physics-grounded hierarchical reflection, improving accuracy by up to 7.5 percentage points while reducing inference cost to around one-third of the strongest baseline.
Dataset: https://huggingface.co/datasets/yigengx/PhySciBench
Code: https://github.com/yigengjiang/physci-deepresearch
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.18648 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.18648 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.