Hugging Face Daily Papers · June 23, 2026 · 5 min read

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Excited to share our paper: Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark.\nWe introduce PhySciBench, a benchmark for evaluating deep-research agents in the physical sciences, and DelveAgent, a modular multi-agent framework for more reliable autonomous scientific reasoning.\nPhySciBench contains 200 expert-curated questions across physics and chemistry, covering multimodal QA, long-context QA, structured information extraction, scientific reasoning, experimental design, and code generation. We find that current frontier systems still struggle: the strongest baseline, Gemini Deep Research, reaches 33.5% accuracy.\nBased on this failure analysis, DelveAgent adds adaptive planning, dual-granularity memory, and physics-grounded hierarchical reflection, improving accuracy by up to 7.5 percentage points while reducing inference cost to around one-third of the strongest baseline.\nDataset: <a href=\"https://huggingface.co/datasets/yigengx/PhySciBench\">https://huggingface.co/datasets/yigengx/PhySciBench</a> Code: <a href=\"https://github.com/yigengjiang/physci-deepresearch\" rel=\"nofollow\">https://github.com/yigengjiang/physci-deepresearch</a>\n","updatedAt":"2026-06-23T02:03:13.077Z","author":{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","fullname":"yigengjiang","name":"yigengx","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.813396155834198},"editors":["yigengx"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18648","authors":[{"_id":"6a379921db23715e9da13475","user":{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","isPro":false,"fullname":"yigengjiang","user":"yigengx","type":"user","name":"yigengx"},"name":"Yigeng Jiang","status":"claimed_verified","statusLastChangedAt":"2026-06-22T16:13:04.037Z","hidden":false},{"_id":"6a379921db23715e9da13476","name":"Tengchao Yang","hidden":false},{"_id":"6a379921db23715e9da13477","name":"Taoyong Cui","hidden":false},{"_id":"6a379921db23715e9da13478","name":"Jiaxing Wan","hidden":false},{"_id":"6a379921db23715e9da13479","name":"Yuan Wang","hidden":false},{"_id":"6a379921db23715e9da1347a","name":"Weida Wang","hidden":false},{"_id":"6a379921db23715e9da1347b","name":"Zhiyu Liu","hidden":false},{"_id":"6a379921db23715e9da1347c","name":"Chuyi Peng","hidden":false},{"_id":"6a379921db23715e9da1347d","name":"Binzhao Luo","hidden":false},{"_id":"6a379921db23715e9da1347e","name":"Maoli Gao","hidden":false},{"_id":"6a379921db23715e9da1347f","name":"Huaihai Huang","hidden":false},{"_id":"6a379921db23715e9da13480","name":"Yuqianer Zeng","hidden":false},{"_id":"6a379921db23715e9da13481","name":"Ziyang Zheng","hidden":false},{"_id":"6a379921db23715e9da13482","name":"Dongchen Huang","hidden":false},{"_id":"6a379921db23715e9da13483","name":"Chao Chen","hidden":false},{"_id":"6a379921db23715e9da13484","name":"Zichao Liu","hidden":false},{"_id":"6a379921db23715e9da13485","name":"Weiping Shen","hidden":false},{"_id":"6a379921db23715e9da13486","name":"Shuchen Pu","hidden":false},{"_id":"6a379921db23715e9da13487","name":"Siyu Zhou","hidden":false},{"_id":"6a379921db23715e9da13488","name":"Runmin Ma","hidden":false},{"_id":"6a379921db23715e9da13489","name":"Yusong Hu","hidden":false},{"_id":"6a379921db23715e9da1348a","name":"Fei Chao","hidden":false},{"_id":"6a379921db23715e9da1348b","name":"Bo Zhang","hidden":false},{"_id":"6a379921db23715e9da1348c","name":"Xiawu Zheng","hidden":false},{"_id":"6a379921db23715e9da1348d","name":"Zifu Wang","hidden":false},{"_id":"6a379921db23715e9da1348e","name":"Lei Bai","hidden":false},{"_id":"6a379921db23715e9da1348f","name":"Yunqi Cai","hidden":false},{"_id":"6a379921db23715e9da13490","name":"Shufei Zhang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/64c7a4401f9614c3e8852abe/CkH1AksBS0KLTEDhg4N4k.png"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark","submittedOnDailyBy":{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","isPro":false,"fullname":"yigengjiang","user":"yigengx","type":"user","name":"yigengx"},"summary":"Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.","upvotes":12,"discussionId":"6a379922db23715e9da13491","projectPage":"https://github.com/yigengjiang/physci-deepresearch","githubRepo":"https://github.com/yigengjiang/physci-deepresearch","githubRepoAddedBy":"user","ai_summary":"PhySciBench benchmark reveals limited performance of current LLM agents in physical science research, leading to development of DelveAgent framework that improves accuracy through modular design and physics-grounded mechanisms.","ai_keywords":["Large Language Model","scientific reasoning","physical science research","benchmark","agent systems","multi-agent framework","adaptive planning loop","dual-granularity memory","hierarchical reflection mechanism"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"6747ee5decec679eafb90450","name":"ShanghaiAiLab","fullname":"shanghai ailab ","avatar":"https://www.gravatar.com/avatar/6cd2acf412ad103653d9ce14a1aacc19?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64c7a4401f9614c3e8852abe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64c7a4401f9614c3e8852abe/ljdMP4wJKqDWpMteGybxl.jpeg","isPro":false,"fullname":"yigengjiang","user":"yigengx","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"6a3a0236d44d9013da213954","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a3a0236d44d9013da213954/TG5UDA83qw-XwUa3t_2jf.jpeg","isPro":false,"fullname":"Jiaxing Wan","user":"WANjx","type":"user"},{"_id":"676c04f44464f476aaa53d1c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/k488J1893F3JGwEMvaeuh.png","isPro":false,"fullname":"Chong Xia","user":"xiac24","type":"user"},{"_id":"67d184469fb867301d2e9276","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/zmcwpnvJecZsejRqvuUw1.png","isPro":false,"fullname":"luobinzhao","user":"luobinzhao","type":"user"},{"_id":"65342bfe484d775cb0bfecac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65342bfe484d775cb0bfecac/s53d6gBnSpmZRd1Dd9D5d.jpeg","isPro":false,"fullname":"Octavi Grau","user":"octavigrau","type":"user"},{"_id":"65bfce62262a04f94c238225","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/IcghnmNxCUurCSOEmChJ4.jpeg","isPro":false,"fullname":"meow","user":"yuuri7","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"685ae7dcaf974ed68b772ad8","avatarUrl":"/avatars/213a8206b1fc59f1b764e12db05faa44.svg","isPro":false,"fullname":"Dominic Savio","user":"dominic-savio","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6747ee5decec679eafb90450","name":"ShanghaiAiLab","fullname":"shanghai ailab ","avatar":"https://www.gravatar.com/avatar/6cd2acf412ad103653d9ce14a1aacc19?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18648.md","query":{}}">

Papers

arxiv:2606.18648

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

Published on Jun 17

· Submitted by

yigengjiang on Jun 23

shanghai ailab

Upvote

Authors:

Yigeng Jiang ,

Abstract

PhySciBench benchmark reveals limited performance of current LLM agents in physical science research, leading to development of DelveAgent framework that improves accuracy through modular design and physics-grounded mechanisms.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

yigengx

Paper author Paper submitter about 23 hours ago

Excited to share our paper: Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark.

We introduce PhySciBench, a benchmark for evaluating deep-research agents in the physical sciences, and DelveAgent, a modular multi-agent framework for more reliable autonomous scientific reasoning.

PhySciBench contains 200 expert-curated questions across physics and chemistry, covering multimodal QA, long-context QA, structured information extraction, scientific reasoning, experimental design, and code generation. We find that current frontier systems still struggle: the strongest baseline, Gemini Deep Research, reaches 33.5% accuracy.

Based on this failure analysis, DelveAgent adds adaptive planning, dual-granularity memory, and physics-grounded hierarchical reflection, improving accuracy by up to 7.5 percentage points while reducing inference cost to around one-third of the strongest baseline.

Dataset: https://huggingface.co/datasets/yigengx/PhySciBench
Code: https://github.com/yigengjiang/physci-deepresearch