Hugging Face Daily Papers · June 2, 2026 · 5 min read

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

As submission numbers continue to rise (NeurIPS 26 40k+; ARR May 26: 17k+), automated reviewing is becoming increasingly difficult to ignore. This year, NeurIPS, EMNLP, and AAAI are testing automated review pipelines, while new papers from Stanford and Mila claim to present safe and stable setups. \nTherefore, the question of whether we can trust them becomes increasingly important. Do the generated reviews really align with human judgment, and are the reviews “safe”? Or can they be gamed to artificially inflate the scores without changing any meaningful content?\nIn our new paper, “Review Arcade: On the Human Alignment and Gameability of LLM Reviews,” we examine 1k real ACL 2025 submissions with real scores and reviews to test whether LLMs align with them.\nWe find in our experiments three key findings:\n<ol>\n<li>Across five model families, we find only limited agreement with human evaluations, as well as differences in accepted and rejected submissions. </li>\n<li>Even when agreement is present, the results are not stable across models, prompts, or even repetitions of the same evaluation, making reliability highly problematic.</li>\n<li>We find a way to “game” the models with an iterative process over 10 iterations to increase LLM review scores (up to ~35% of the submissions), without doing meaningful changes (see Fig.).</li>\n</ol>\nTherefore, we strongly advise against relying on reviews generated by LLMs alone and encourage a discussion about whether this should be the solution to the enormous number of submissions.\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/646128252815a070474bdeba/biDahK7uum2zY1oUifEkv.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/646128252815a070474bdeba/biDahK7uum2zY1oUifEkv.png\" alt=\"linkedin_post_quadrat2\"></a>\nFull Details: <a href=\"https://arxiv.org/pdf/2605.28897\" rel=\"nofollow\">https://arxiv.org/pdf/2605.28897</a>\n","updatedAt":"2026-06-02T13:05:41.961Z","author":{"_id":"646128252815a070474bdeba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646128252815a070474bdeba/Zd29TaeBUFdiGzcvCu2sX.png","fullname":"Jan Strich","name":"strich","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9116708040237427},"editors":["strich"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/646128252815a070474bdeba/Zd29TaeBUFdiGzcvCu2sX.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.28897","authors":[{"_id":"6a1e9816808ddbc3c7d43f9e","name":"Hans Ole Hatzel","hidden":false},{"_id":"6a1e9816808ddbc3c7d43f9f","name":"Sebastian Steindl","hidden":false},{"_id":"6a1e9816808ddbc3c7d43fa0","user":{"_id":"646128252815a070474bdeba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646128252815a070474bdeba/Zd29TaeBUFdiGzcvCu2sX.png","isPro":false,"fullname":"Jan Strich","user":"strich","type":"user","name":"strich"},"name":"Jan Strich","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:04:02.403Z","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Review Arcade: On the Human Alignment and Gameability of LLM Reviews","submittedOnDailyBy":{"_id":"646128252815a070474bdeba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646128252815a070474bdeba/Zd29TaeBUFdiGzcvCu2sX.png","isPro":false,"fullname":"Jan Strich","user":"strich","type":"user","name":"strich"},"summary":"LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this \"gaming\" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.","upvotes":1,"discussionId":"6a1e9816808ddbc3c7d43fa1","githubRepo":"https://github.com/uhh-hcds/reviewarcade","githubRepoAddedBy":"user","ai_summary":"Empirical analysis reveals limited alignment between LLM-generated reviews and human reviews, with varying performance across different prompts and models, and demonstrates that authors can strategically improve paper scores through iterative revision based on LLM feedback.","ai_keywords":[""],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"66d82784673a350b18365d46","name":"G4KMU","fullname":"Hub of Computing and Data Science (HCDS) - G4KMU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/646128252815a070474bdeba/LfHctN7F1ekEP6n4HHvDF.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"646128252815a070474bdeba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/646128252815a070474bdeba/Zd29TaeBUFdiGzcvCu2sX.png","isPro":false,"fullname":"Jan Strich","user":"strich","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66d82784673a350b18365d46","name":"G4KMU","fullname":"Hub of Computing and Data Science (HCDS) - G4KMU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/646128252815a070474bdeba/LfHctN7F1ekEP6n4HHvDF.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.28897.md"}">

Papers

arxiv:2605.28897

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Published on May 27

· Submitted by

Jan Strich on Jun 2

Hub of Computing and Data Science (HCDS) - G4KMU

Upvote

Authors:

Jan Strich

Abstract

Empirical analysis reveals limited alignment between LLM-generated reviews and human reviews, with varying performance across different prompts and models, and demonstrates that authors can strategically improve paper scores through iterative revision based on LLM feedback.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.

View arXiv page View PDF GitHub 1 Add to collection

Community

strich

Paper author Paper submitter about 13 hours ago

As submission numbers continue to rise (NeurIPS 26 40k+; ARR May 26: 17k+), automated reviewing is becoming increasingly difficult to ignore. This year, NeurIPS, EMNLP, and AAAI are testing automated review pipelines, while new papers from Stanford and Mila claim to present safe and stable setups.

Therefore, the question of whether we can trust them becomes increasingly important.
Do the generated reviews really align with human judgment, and are the reviews “safe”?
Or can they be gamed to artificially inflate the scores without changing any meaningful content?

In our new paper, “Review Arcade: On the Human Alignment and Gameability of LLM Reviews,” we examine 1k real ACL 2025 submissions with real scores and reviews to test whether LLMs align with them.

We find in our experiments three key findings:

Across five model families, we find only limited agreement with human evaluations, as well as differences in accepted and rejected submissions.
Even when agreement is present, the results are not stable across models, prompts, or even repetitions of the same evaluation, making reliability highly problematic.
We find a way to “game” the models with an iterative process over 10 iterations to increase LLM review scores (up to ~35% of the submissions), without doing meaningful changes (see Fig.).

Therefore, we strongly advise against relying on reviews generated by LLMs alone and encourage a discussion about whether this should be the solution to the enormous number of submissions.

Full Details: https://arxiv.org/pdf/2605.28897

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.28897

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.28897 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28897 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers