Hugging Face Daily Papers · · 5 min read

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Static benchmarks are rapidly saturating as frontier LLMs improve. On LiveCodeBench, frontier models now exceed 99% Pass@1 on easy splits and over 90% on average, making it harder to distinguish strong coding models or obtain useful training signal.</p>\n<p>BenchEvolver is a solution-centric evolutionary framework for turning solved coding problems into harder variants. Instead of generating new problem statements from scratch, it evolves reference solutions through structured transformations, then derives statements and tests from the evolved executable semantics.</p>\n<p>Applied to LiveCodeBench and SciCode, BenchEvolver produces harder, valid, diverse tasks with verifiable correctness. We also curate LiveCodeBench-Plus, a 91-problem benchmark where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear model discrimination.</p>\n<p>Beyond evaluation, evolved tasks remain challenging even for the model that generates them and can serve as reusable RL training signal. For gpt-oss-20b, seed+evolved training yields +8.7/+8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only improvements by 70.7% and 34.8%.</p>\n<p>Project: <a href=\"https://benchevolver.github.io/\" rel=\"nofollow\">https://benchevolver.github.io/</a><br>Code: <a href=\"https://github.com/thu-wyz/BenchEvolver\" rel=\"nofollow\">https://github.com/thu-wyz/BenchEvolver</a></p>\n","updatedAt":"2026-06-04T01:35:51.791Z","author":{"_id":"698d044b58ce9a55c8344b47","avatarUrl":"/avatars/80ae3fea32ed67438494abc08dc9a9c6.svg","fullname":"yangzhen","name":"yzw04","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8605849146842957},"editors":["yzw04"],"editorAvatarUrls":["/avatars/80ae3fea32ed67438494abc08dc9a9c6.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01286","authors":[{"_id":"6a1edbcbe292c1c78ecb1097","user":{"_id":"698d044b58ce9a55c8344b47","avatarUrl":"/avatars/80ae3fea32ed67438494abc08dc9a9c6.svg","isPro":false,"fullname":"yangzhen","user":"yzw04","type":"user","name":"yzw04"},"name":"Yangzhen Wu","status":"claimed_verified","statusLastChangedAt":"2026-06-03T14:19:40.197Z","hidden":false},{"_id":"6a1edbcbe292c1c78ecb1098","name":"Aaron J. Li","hidden":false},{"_id":"6a1edbcbe292c1c78ecb1099","name":"Wenjie Ma","hidden":false},{"_id":"6a1edbcbe292c1c78ecb109a","name":"Li Cao","hidden":false},{"_id":"6a1edbcbe292c1c78ecb109b","name":"Ziheng Zhou","hidden":false},{"_id":"6a1edbcbe292c1c78ecb109c","name":"Mert Cemri","hidden":false},{"_id":"6a1edbcbe292c1c78ecb109d","name":"Shu Liu","hidden":false},{"_id":"6a1edbcbe292c1c78ecb109e","name":"Yuran Xiu","hidden":false},{"_id":"6a1edbcbe292c1c78ecb109f","name":"Chenxiao Yan","hidden":false},{"_id":"6a1edbcbe292c1c78ecb10a0","name":"Haikun Zhao","hidden":false},{"_id":"6a1edbcbe292c1c78ecb10a1","name":"Bin Yu","hidden":false},{"_id":"6a1edbcbe292c1c78ecb10a2","name":"Ion Stoica","hidden":false},{"_id":"6a1edbcbe292c1c78ecb10a3","name":"Dawn Song","hidden":false}],"publishedAt":"2026-05-31T00:00:00.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution","submittedOnDailyBy":{"_id":"698d044b58ce9a55c8344b47","avatarUrl":"/avatars/80ae3fea32ed67438494abc08dc9a9c6.svg","isPro":false,"fullname":"yangzhen","user":"yzw04","type":"user","name":"yzw04"},"summary":"The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.","upvotes":4,"discussionId":"6a1edbcbe292c1c78ecb10a4","projectPage":"https://benchevolver.github.io/","githubRepo":"https://github.com/thu-wyz/BenchEvolver","githubRepoAddedBy":"user","ai_summary":"BenchEvolver is an evolutionary framework that automatically generates harder coding problems from existing ones, creating challenging benchmarks that maintain validity and diversity while enabling model self-improvement and enhanced training performance.","ai_keywords":["coding problems","evolutionary framework","reference solutions","structured transformations","executable semantics","benchmark saturation","Pass@1","frontier models","LiveCodeBench","SciCode","LCB-v6","self-improvement","reinforcement learning","training signal"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"66b1baeff10262fc4fa61961","name":"UCBerkeley","fullname":"University of California, Berkeley","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f425c3a096536aeab42dea/bxNKEkprdm5JI1wkjmNAL.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"698d044b58ce9a55c8344b47","avatarUrl":"/avatars/80ae3fea32ed67438494abc08dc9a9c6.svg","isPro":false,"fullname":"yangzhen","user":"yzw04","type":"user"},{"_id":"64bf84bf1f38bef571e7e3e8","avatarUrl":"/avatars/faf449ebbdb7ad2a99c70fcd597238dd.svg","isPro":false,"fullname":"Aaron Li","user":"aaronjli","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"69830d1aa33f96b731821c84","avatarUrl":"/avatars/6e69a7adab1b0ee0236d333de25a0e65.svg","isPro":false,"fullname":"Anan Rattanakorn","user":"yes-456","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66b1baeff10262fc4fa61961","name":"UCBerkeley","fullname":"University of California, Berkeley","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f425c3a096536aeab42dea/bxNKEkprdm5JI1wkjmNAL.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.01286.md"}">
Papers
arxiv:2606.01286

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Published on May 31
· Submitted by
yangzhen
on Jun 4
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

BenchEvolver is an evolutionary framework that automatically generates harder coding problems from existing ones, creating challenging benchmarks that maintain validity and diversity while enabling model self-improvement and enhanced training performance.

The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.

Community

Paper author Paper submitter about 7 hours ago

Static benchmarks are rapidly saturating as frontier LLMs improve. On LiveCodeBench, frontier models now exceed 99% Pass@1 on easy splits and over 90% on average, making it harder to distinguish strong coding models or obtain useful training signal.

BenchEvolver is a solution-centric evolutionary framework for turning solved coding problems into harder variants. Instead of generating new problem statements from scratch, it evolves reference solutions through structured transformations, then derives statements and tests from the evolved executable semantics.

Applied to LiveCodeBench and SciCode, BenchEvolver produces harder, valid, diverse tasks with verifiable correctness. We also curate LiveCodeBench-Plus, a 91-problem benchmark where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear model discrimination.

Beyond evaluation, evolved tasks remain challenging even for the model that generates them and can serve as reusable RL training signal. For gpt-oss-20b, seed+evolved training yields +8.7/+8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only improvements by 70.7% and 34.8%.

Project: https://benchevolver.github.io/
Code: https://github.com/thu-wyz/BenchEvolver

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.01286
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.01286 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.01286 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers