Hugging Face Daily Papers · June 24, 2026 · 4 min read

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

NatureBench is a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, spanning 6 scientific domains, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery. Each task asks an agent to solve a real scientific machine-learning problem and is scored against the source paper's reported state of the art.</p>\n<p>NatureBench is built on NatureGym, an automated pipeline that converts a published paper into a containerized task package comprising a task brief, the paper's dataset, a held-out test set with hidden ground truth, and an automated evaluator.</p>\n","updatedAt":"2026-06-24T03:10:25.413Z","author":{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","fullname":"Kaiyan Zhang","name":"iseesaw","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":12,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.899810254573822},"editors":["iseesaw"],"editorAvatarUrls":["/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.24530","authors":[{"_id":"6a3b4a590a86ac3098d5d701","name":"Yuru Wang","hidden":false},{"_id":"6a3b4a590a86ac3098d5d702","name":"Lejun Cheng","hidden":false},{"_id":"6a3b4a590a86ac3098d5d703","name":"Yuxin Zuo","hidden":false},{"_id":"6a3b4a590a86ac3098d5d704","name":"Sihang Zeng","hidden":false},{"_id":"6a3b4a590a86ac3098d5d705","name":"Bingxiang He","hidden":false},{"_id":"6a3b4a590a86ac3098d5d706","name":"Che Jiang","hidden":false},{"_id":"6a3b4a590a86ac3098d5d707","name":"Junlin Yang","hidden":false},{"_id":"6a3b4a590a86ac3098d5d708","name":"Yuchong Wang","hidden":false},{"_id":"6a3b4a590a86ac3098d5d709","name":"Kaikai Zhao","hidden":false},{"_id":"6a3b4a590a86ac3098d5d70a","name":"Weifeng Huang","hidden":false},{"_id":"6a3b4a590a86ac3098d5d70b","name":"Kai Tian","hidden":false},{"_id":"6a3b4a590a86ac3098d5d70c","name":"Zhenzhao Yuan","hidden":false},{"_id":"6a3b4a590a86ac3098d5d70d","name":"Jincheng Zhong","hidden":false},{"_id":"6a3b4a590a86ac3098d5d70e","name":"Weizhi Wang","hidden":false},{"_id":"6a3b4a590a86ac3098d5d70f","name":"Ning Ding","hidden":false},{"_id":"6a3b4a590a86ac3098d5d710","name":"Bowen Zhou","hidden":false},{"_id":"6a3b4a590a86ac3098d5d711","name":"Kaiyan Zhang","hidden":false}],"publishedAt":"2026-06-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-24T00:00:00.000Z","title":"NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?","submittedOnDailyBy":{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","isPro":false,"fullname":"Kaiyan Zhang","user":"iseesaw","type":"user","name":"iseesaw"},"summary":"We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench","upvotes":31,"discussionId":"6a3b4a590a86ac3098d5d712","projectPage":"https://frontisai.github.io/NatureBench/","githubRepo":"https://github.com/FrontisAI/NatureBench","githubRepoAddedBy":"user","ai_summary":"NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents' ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation rather than genuine scientific innovation.","ai_keywords":["NatureBench","NatureGym","AI coding agents","scientific discovery","methodological translation","supervised prediction problems","environment-fragmentation problem","cross-discipline benchmark","peer-reviewed publications","containerized environment"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6,"organization":{"_id":"6a32950e4b5c1c0ebee0e552","name":"FrontisAI","fullname":"Frontis AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bc94cd85a3ab33829b6211/1w_MutesbGw4NwNkA_dn5.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"60bc94cd85a3ab33829b6211","avatarUrl":"/avatars/b57d36c7577fbbb42ea5b963eef4144a.svg","isPro":false,"fullname":"Kaiyan Zhang","user":"iseesaw","type":"user"},{"_id":"6898562e524e753b04240630","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ngRjX-dHx7SUNWavJ3IH6.png","isPro":false,"fullname":"Jincheng","user":"JinCheng777","type":"user"},{"_id":"6458e8ce4b7baff9a84aa0da","avatarUrl":"/avatars/c450f4885e68d28c22fd87f9efdfedec.svg","isPro":false,"fullname":"kaikai zhao","user":"LifeIsSoSolong","type":"user"},{"_id":"663f07d029be04778ba97871","avatarUrl":"/avatars/fb7c9d4a2c537d918a3267e7cbc03f04.svg","isPro":false,"fullname":"Xingtai Lv","user":"XingtaiHF","type":"user"},{"_id":"622474f38dc6b0b64f5e903d","avatarUrl":"/avatars/d6b60a014277a8ec7d564163c5f644aa.svg","isPro":false,"fullname":"Yuxin Zuo","user":"yuxinzuo","type":"user"},{"_id":"66d12f904cc92b9866c7f837","avatarUrl":"/avatars/5c3ce46c1f953a7081e344cd91a8d17b.svg","isPro":false,"fullname":"Hongyi Liu","user":"hongyi-liu","type":"user"},{"_id":"67b3f6c419f758f021b22b79","avatarUrl":"/avatars/3262f798f4e2f3438e7e41bd9ac0896a.svg","isPro":false,"fullname":"sai","user":"saiclj","type":"user"},{"_id":"6a3a0be66bc9e697359f9fef","avatarUrl":"/avatars/3f84fa43531d7924c6a257d6e6cab0db.svg","isPro":false,"fullname":"Weizhi Wang","user":"wangweiz22","type":"user"},{"_id":"6a3a074561d192c6bbade785","avatarUrl":"/avatars/f82412658c5cf1a42ba06b96ca234427.svg","isPro":false,"fullname":"Yisheng Zhang","user":"andyzys123","type":"user"},{"_id":"65697feb9fb2d79a79e14e0a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65697feb9fb2d79a79e14e0a/wVGaBjn8pQIJneZWSFIwS.jpeg","isPro":false,"fullname":"haodi lei","user":"bingyang-lei","type":"user"},{"_id":"64802b46c57f629056c578ee","avatarUrl":"/avatars/50748f7b782c763a23e4bf04869a3466.svg","isPro":false,"fullname":"yiyi","user":"cnwang","type":"user"},{"_id":"64d6fd4e505306fcd2cc098f","avatarUrl":"/avatars/305e097eac8cac76a34ec1bde64ee7b8.svg","isPro":false,"fullname":"zzyin","user":"fanshutou","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"6a32950e4b5c1c0ebee0e552","name":"FrontisAI","fullname":"Frontis AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60bc94cd85a3ab33829b6211/1w_MutesbGw4NwNkA_dn5.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.24530.md","query":{}}">

Papers

arxiv:2606.24530

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Published on Jun 23

· Submitted by

Kaiyan Zhang on Jun 24

#2 Paper of the day

Frontis AI

Upvote

Authors:

Abstract

NatureBench presents a cross-disciplinary benchmark of 90 scientific tasks derived from Nature publications to assess AI coding agents' ability to achieve discovery rather than just reproduction, revealing that current agents primarily rely on methodological translation rather than genuine scientific innovation.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench

View arXiv page View PDF Project page GitHub 6 Add to collection

Community

iseesaw

Paper submitter about 4 hours ago

NatureBench is built on NatureGym, an automated pipeline that converts a published paper into a containerized task package comprising a task brief, the paper's dataset, a held-out test set with hidden ground truth, and an automated evaluator.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.24530

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.24530 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.24530 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

NatureBench: Can Coding Agents Match the Published SOTA of Nature-Family Papers?

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers