Hugging Face Daily Papers · May 21, 2026 · 7 min read

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.</p>\n","updatedAt":"2026-05-21T02:02:44.278Z","author":{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","fullname":"Seungone Kim","name":"seungone","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":31,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9419586658477783},"editors":["seungone"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20668","authors":[{"_id":"6a0e6746164dbbc68a26c42d","name":"Seungone Kim","hidden":false},{"_id":"6a0e6746164dbbc68a26c42e","name":"Dongkeun Yoon","hidden":false},{"_id":"6a0e6746164dbbc68a26c42f","name":"Kiril Gashteovski","hidden":false},{"_id":"6a0e6746164dbbc68a26c430","name":"Juyoung Suk","hidden":false},{"_id":"6a0e6746164dbbc68a26c431","name":"Jinheon Baek","hidden":false},{"_id":"6a0e6746164dbbc68a26c432","name":"Pranjal Aggarwal","hidden":false},{"_id":"6a0e6746164dbbc68a26c433","name":"Ian Wu","hidden":false},{"_id":"6a0e6746164dbbc68a26c434","name":"Viktor Zaverkin","hidden":false},{"_id":"6a0e6746164dbbc68a26c435","name":"Spase Petkoski","hidden":false},{"_id":"6a0e6746164dbbc68a26c436","name":"Daniel R. Schrider","hidden":false},{"_id":"6a0e6746164dbbc68a26c437","name":"Ilija Dukovski","hidden":false},{"_id":"6a0e6746164dbbc68a26c438","name":"Francesco Santini","hidden":false},{"_id":"6a0e6746164dbbc68a26c439","name":"Biljana Mitreska","hidden":false},{"_id":"6a0e6746164dbbc68a26c43a","name":"Yong Jeong","hidden":false},{"_id":"6a0e6746164dbbc68a26c43b","name":"Kyeongha Kwon","hidden":false},{"_id":"6a0e6746164dbbc68a26c43c","name":"Young Min Sim","hidden":false},{"_id":"6a0e6746164dbbc68a26c43d","name":"Dragana Manasova","hidden":false},{"_id":"6a0e6746164dbbc68a26c43e","name":"Arthur Porto","hidden":false},{"_id":"6a0e6746164dbbc68a26c43f","name":"Biljana Mojsoska","hidden":false},{"_id":"6a0e6746164dbbc68a26c440","name":"Makoto Takamoto","hidden":false},{"_id":"6a0e6746164dbbc68a26c441","name":"Marko Shuntov","hidden":false},{"_id":"6a0e6746164dbbc68a26c442","name":"Ruoqi Liu","hidden":false},{"_id":"6a0e6746164dbbc68a26c443","name":"Hyunjoo Jenny Lee","hidden":false},{"_id":"6a0e6746164dbbc68a26c444","name":"Niyazi Ulas Dinç","hidden":false},{"_id":"6a0e6746164dbbc68a26c445","name":"Yehhyun Jo","hidden":false},{"_id":"6a0e6746164dbbc68a26c446","name":"Sunkyu Han","hidden":false},{"_id":"6a0e6746164dbbc68a26c447","name":"Chungwoo Lee","hidden":false},{"_id":"6a0e6746164dbbc68a26c448","name":"Huishan Li","hidden":false},{"_id":"6a0e6746164dbbc68a26c449","name":"Esther H. R. Tsai","hidden":false},{"_id":"6a0e6746164dbbc68a26c44a","name":"Ergun Simsek","hidden":false},{"_id":"6a0e6746164dbbc68a26c44b","name":"Khushboo Shafi","hidden":false},{"_id":"6a0e6746164dbbc68a26c44c","name":"Yeonseung Chung","hidden":false},{"_id":"6a0e6746164dbbc68a26c44d","name":"Jihye Park","hidden":false},{"_id":"6a0e6746164dbbc68a26c44e","name":"Aleksandar Shulevski","hidden":false},{"_id":"6a0e6746164dbbc68a26c44f","name":"Henrik Christiansen","hidden":false},{"_id":"6a0e6746164dbbc68a26c450","name":"Yoosang Son","hidden":false},{"_id":"6a0e6746164dbbc68a26c451","name":"Elly Knight","hidden":false},{"_id":"6a0e6746164dbbc68a26c452","name":"Amanda Montoya","hidden":false},{"_id":"6a0e6746164dbbc68a26c453","name":"Jeongyoun Ahn","hidden":false},{"_id":"6a0e6746164dbbc68a26c454","name":"Christian Langkammer","hidden":false},{"_id":"6a0e6746164dbbc68a26c455","name":"Heera Moon","hidden":false},{"_id":"6a0e6746164dbbc68a26c456","name":"Changwon Yoon","hidden":false},{"_id":"6a0e6746164dbbc68a26c457","name":"Nikola Stikov","hidden":false},{"_id":"6a0e6746164dbbc68a26c458","name":"Mooseok Jang","hidden":false},{"_id":"6a0e6746164dbbc68a26c459","name":"Edward Choi","hidden":false},{"_id":"6a0e6746164dbbc68a26c45a","name":"Junhan Kim","hidden":false},{"_id":"6a0e6746164dbbc68a26c45b","name":"Yeon Sik Jung","hidden":false},{"_id":"6a0e6746164dbbc68a26c45c","name":"Woo Youn Kim","hidden":false},{"_id":"6a0e6746164dbbc68a26c45d","name":"Jae Kyoung Kim","hidden":false},{"_id":"6a0e6746164dbbc68a26c45e","name":"Ishraq Md Anjum","hidden":false},{"_id":"6a0e6746164dbbc68a26c45f","name":"Hyun Uk Kim","hidden":false},{"_id":"6a0e6746164dbbc68a26c460","name":"Drew Bridges","hidden":false},{"_id":"6a0e6746164dbbc68a26c461","name":"Carolin Lawrence","hidden":false},{"_id":"6a0e6746164dbbc68a26c462","name":"Xiang Yue","hidden":false},{"_id":"6a0e6746164dbbc68a26c463","name":"Alice Oh","hidden":false},{"_id":"6a0e6746164dbbc68a26c464","name":"Akari Asai","hidden":false},{"_id":"6a0e6746164dbbc68a26c465","name":"Sean Welleck","hidden":false},{"_id":"6a0e6746164dbbc68a26c466","name":"Graham Neubig","hidden":false}],"publishedAt":"2026-05-20T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists","submittedOnDailyBy":{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","isPro":true,"fullname":"Seungone Kim","user":"seungone","type":"user","name":"seungone"},"summary":"With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall short, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.","upvotes":9,"discussionId":"6a0e6746164dbbc68a26c467","projectPage":"https://prometheus-eval.github.io/cmu-paper-reviewer/","githubRepo":"https://github.com/prometheus-eval/cmu-paper-reviewer","githubRepoAddedBy":"user","ai_summary":"AI reviewers demonstrate superior performance in identifying correct criticisms compared to human reviewers, yet exhibit limitations in subfield knowledge and context management that distinguish them from human peers.","ai_keywords":["GPT-5.2","Gemini 3.0 Pro","Claude Opus 4.5"],"githubStars":6,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6469949654873f0043b09c22","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/Lk7IJAR16Wa_sGJ2g81AQ.jpeg","isPro":true,"fullname":"Seungone Kim","user":"seungone","type":"user"},{"_id":"6434b6619bd5a84b5dcfa4de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6434b6619bd5a84b5dcfa4de/h8Q6kPNjFNc03wmdboHzq.jpeg","isPro":true,"fullname":"Young-Jun Lee","user":"passing2961","type":"user"},{"_id":"63036b6c5c70c21d0ea79d48","avatarUrl":"/avatars/a7eb03f5cbd4eaa09fe807bbed8bc0f7.svg","isPro":false,"fullname":"Jinheon Baek","user":"jinheon","type":"user"},{"_id":"60f3280b702e4ecabac96740","avatarUrl":"/avatars/ce12e328352aac25f0370333bf7f6cce.svg","isPro":false,"fullname":"Pranjal Aggarwal","user":"Pranjal2041","type":"user"},{"_id":"6138cc1306dd10833d2db64b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6138cc1306dd10833d2db64b/IRX4y-8M4YlzR_8jOwkKp.jpeg","isPro":false,"fullname":"Juyoung Suk","user":"juyoungml","type":"user"},{"_id":"617f679fb15f8a665f3999fc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/617f679fb15f8a665f3999fc/NW1vkLsGAlWpAQYTux05X.jpeg","isPro":false,"fullname":"Dongkeun Yoon","user":"DKYoon","type":"user"},{"_id":"638763415c68cf2713b8ad7c","avatarUrl":"/avatars/3113c3e71caa5cd5b6f8ce9c28241bc3.svg","isPro":false,"fullname":"Kiril Gashteovski","user":"kgashteo","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"6358edff3b3638bdac83f7ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1666772404424-noauth.jpeg","isPro":false,"fullname":"Pratyay Banerjee","user":"Neilblaze","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20668.md"}">

Papers

arxiv:2605.20668

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Published on May 20

· Submitted by

Seungone Kim on May 21

Carnegie Mellon University

Upvote

Authors:

Abstract

AI reviewers demonstrate superior performance in identifying correct criticisms compared to human reviewers, yet exhibit limitations in subfield knowledge and context management that distinguish them from human peers.

AI-generated summary

View arXiv page View PDF Project page GitHub 6 Add to collection

Community

seungone

Paper submitter about 11 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.20668

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.20668 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.20668 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.20668 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers