Hugging Face Daily Papers · · 3 min read

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

SoundnessBench: Testing Whether LLMs Can Assess the Scientific Soundness of Research Plans</p>\n","updatedAt":"2026-06-01T08:40:27.474Z","author":{"_id":"638f26bb3783be5e1d04a86b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f26bb3783be5e1d04a86b/iLDzwTKPAQcZJv7s6ZLcp.jpeg","fullname":"Sy-Tuyen Ho","name":"hosytuyen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6409966945648193},"editors":["hosytuyen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/638f26bb3783be5e1d04a86b/iLDzwTKPAQcZJv7s6ZLcp.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30329","authors":[{"_id":"6a1a630f808ddbc3c7d42f3a","user":{"_id":"638f26bb3783be5e1d04a86b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f26bb3783be5e1d04a86b/iLDzwTKPAQcZJv7s6ZLcp.jpeg","isPro":false,"fullname":"Sy-Tuyen Ho","user":"hosytuyen","type":"user","name":"hosytuyen"},"name":"Sy-Tuyen Ho","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:35:53.383Z","hidden":false},{"_id":"6a1a630f808ddbc3c7d42f3b","name":"Minghui Liu","hidden":false},{"_id":"6a1a630f808ddbc3c7d42f3c","name":"Huy Nghiem","hidden":false},{"_id":"6a1a630f808ddbc3c7d42f3d","name":"Furong Huang","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?","submittedOnDailyBy":{"_id":"638f26bb3783be5e1d04a86b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f26bb3783be5e1d04a86b/iLDzwTKPAQcZJv7s6ZLcp.jpeg","isPro":false,"fullname":"Sy-Tuyen Ho","user":"hosytuyen","type":"user","name":"hosytuyen"},"summary":"Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.","upvotes":1,"discussionId":"6a1a6310808ddbc3c7d42f3e","projectPage":"https://hosytuyen.github.io/projects/SoundnessBench/","githubRepo":"https://github.com/hosytuyen/hosytuyen.github.io","githubRepoAddedBy":"user","ai_summary":"SoundnessBench evaluates large language models' ability to assess the methodological validity of machine learning research proposals, revealing persistent optimism bias in current models.","ai_keywords":["Large Language Models","machine-learning research proposals","reviewer soundness","ICLR submissions","autonomous AI research agents","hypothesis generation","peer review","benchmark evaluation"],"githubStars":0,"organization":{"_id":"64cbc5468174e45ae060ec46","name":"furonghuang-lab","fullname":"Furong Huang's Lab at UMD","avatar":"https://www.gravatar.com/avatar/add71ee6bbcef2277b077b42b3cba002?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"638f26bb3783be5e1d04a86b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f26bb3783be5e1d04a86b/iLDzwTKPAQcZJv7s6ZLcp.jpeg","isPro":false,"fullname":"Sy-Tuyen Ho","user":"hosytuyen","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"64cbc5468174e45ae060ec46","name":"furonghuang-lab","fullname":"Furong Huang's Lab at UMD","avatar":"https://www.gravatar.com/avatar/add71ee6bbcef2277b077b42b3cba002?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30329.md"}">
Papers
arxiv:2605.30329

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

Published on May 28
· Submitted by
Sy-Tuyen Ho
on Jun 1
Authors:
,
,

Abstract

SoundnessBench evaluates large language models' ability to assess the methodological validity of machine learning research proposals, revealing persistent optimism bias in current models.

AI-generated summary

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.

Community

Paper author Paper submitter about 2 hours ago

SoundnessBench: Testing Whether LLMs Can Assess the Scientific Soundness of Research Plans

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30329
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30329 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30329 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers