Hugging Face Daily Papers · June 11, 2026 · 8 min read

Can Generalist Agents Automate Data Curation?

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Hi all. Quick summary of what we think is the interesting part:\nGeneralist coding agents (Claude Code, Codex, OpenHands with Kimi K2.5 / Qwen3.5-397B) can already run a full data-curation loop: inspect the pool, implement a selection policy, train, evaluate, revise. They match published data-selection baselines (ICONS, ARDS) within 10 iterations, recovering ~60% of the full-data fine-tuning gain from 1.5% of LLaVA-665K. The loop is not limited to instruction tuning: the same setup works for CLIP pretraining on DataComp-Small, where the agent clearly beats the strongest filtering baseline (top-30% CLIP L/14 score).\nBut trajectory analysis shows what we call the execution-research gap: agents grind local knobs (source ratios, length thresholds, random seeds) instead of exploring new method families. In a typical open-prompt run, only 2/10 iterations try something genuinely new. Strategy guides and paper references don't fix it. A scaffold requiring each iteration to cite, instantiate, and adapt a method from prior research does: the agent composed an EL2N-style top-loss + noise-filter policy, with no human design input, that beats published baselines given 10x its data budget.\nOne more finding we find intriguing: curation search itself scales. Extending the agent budget from 10 to 50 iterations keeps improving average outcomes with no clear plateau. Agent search iterations look like a meaningful compute axis for the finite-data regime.\nEnvironment, trajectory diagnostics, and all scaffolds are open source: <a href=\"https://github.com/feiyang-k/curation-bench\" rel=\"nofollow\">https://github.com/feiyang-k/curation-bench</a>. Happy to answer questions.\n","updatedAt":"2026-06-11T23:04:52.958Z","author":{"_id":"6a29e962da1e428311955e3c","avatarUrl":"/avatars/280bed391b97bb46559bd4a464a31d43.svg","fullname":"Adam Nguyen","name":"adamtrnguyen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8414938449859619},"editors":["adamtrnguyen"],"editorAvatarUrls":["/avatars/280bed391b97bb46559bd4a464a31d43.svg"],"reactions":[],"isReport":false}},{"id":"6a2b67328f309706792cc9f9","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":364,"isUserFollowing":false},"createdAt":"2026-06-12T01:56:02.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes](https://huggingface.co/papers/2605.05724) (2026)\n* [MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?](https://huggingface.co/papers/2606.01993) (2026)\n* [DataMaster: Data-Centric Autonomous AI Research](https://huggingface.co/papers/2605.10906) (2026)\n* [Kintsugi: Learning Policies by Repairing Executable Knowledge Bases](https://huggingface.co/papers/2605.09487) (2026)\n* [Exploring Autonomous Agentic Data Engineering for Model Specialization](https://huggingface.co/papers/2605.30407) (2026)\n* [Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents](https://huggingface.co/papers/2605.10832) (2026)\n* [SkillOpt: Executive Strategy for Self-Evolving Agent Skills](https://huggingface.co/papers/2605.23904) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.05724\">Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2606.01993\">MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.10906\">DataMaster: Data-Centric Autonomous AI Research</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09487\">Kintsugi: Learning Policies by Repairing Executable Knowledge Bases</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30407\">Exploring Autonomous Agentic Data Engineering for Model Specialization</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.10832\">Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23904\">SkillOpt: Executive Strategy for Self-Evolving Agent Skills</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-06-12T01:56:02.023Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":364,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7376936674118042},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.04261","authors":[{"_id":"6a29ea1180a9c7c6830c0e3b","name":"Feiyang Kang","hidden":false},{"_id":"6a29ea1180a9c7c6830c0e3c","name":"Hanze Li","hidden":false},{"_id":"6a29ea1180a9c7c6830c0e3d","user":{"_id":"6a29e962da1e428311955e3c","avatarUrl":"/avatars/280bed391b97bb46559bd4a464a31d43.svg","isPro":false,"fullname":"Adam Nguyen","user":"adamtrnguyen","type":"user","name":"adamtrnguyen"},"name":"Adam Nguyen","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:39:04.042Z","hidden":false},{"_id":"6a29ea1180a9c7c6830c0e3e","name":"Mahavir Dabas","hidden":false},{"_id":"6a29ea1180a9c7c6830c0e3f","name":"Jiaqi W. Ma","hidden":false},{"_id":"6a29ea1180a9c7c6830c0e40","name":"Frederic Sala","hidden":false},{"_id":"6a29ea1180a9c7c6830c0e41","name":"Dawn Song","hidden":false},{"_id":"6a29ea1180a9c7c6830c0e42","name":"Ruoxi Jia","hidden":false}],"publishedAt":"2026-06-02T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"Can Generalist Agents Automate Data Curation?","submittedOnDailyBy":{"_id":"6a29e962da1e428311955e3c","avatarUrl":"/avatars/280bed391b97bb46559bd4a464a31d43.svg","isPro":false,"fullname":"Adam Nguyen","user":"adamtrnguyen","type":"user","name":"adamtrnguyen"},"summary":"Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.","upvotes":1,"discussionId":"6a29ea1180a9c7c6830c0e43","githubRepo":"https://github.com/feiyang-k/curation-bench","githubRepoAddedBy":"user","ai_summary":"Automated data curation using generalist coding agents shows promise but requires structured scaffolding to achieve superior performance compared to traditional methods.","ai_keywords":["data-curation loop","agent-centric benchmark","vision-language instruction-tuning","data-selection policy","method-guided exploration","execution-research gap"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a29e962da1e428311955e3c","avatarUrl":"/avatars/280bed391b97bb46559bd4a464a31d43.svg","isPro":false,"fullname":"Adam Nguyen","user":"adamtrnguyen","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0}">

Papers

arxiv:2606.04261

Can Generalist Agents Automate Data Curation?

Published on Jun 2

· Submitted by

Adam Nguyen on Jun 11

Upvote

Authors:

Adam Nguyen ,

Abstract

Automated data curation using generalist coding agents shows promise but requires structured scaffolding to achieve superior performance compared to traditional methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

View arXiv page View PDF GitHub 2 Add to collection

Community

adamtrnguyen

Paper author Paper submitter about 3 hours ago

Hi all. Quick summary of what we think is the interesting part:

Generalist coding agents (Claude Code, Codex, OpenHands with Kimi K2.5 / Qwen3.5-397B) can already run a full data-curation loop: inspect the pool, implement a selection policy, train, evaluate, revise. They match published data-selection baselines (ICONS, ARDS) within 10 iterations, recovering ~60% of the full-data fine-tuning gain from 1.5% of LLaVA-665K. The loop is not limited to instruction tuning: the same setup works for CLIP pretraining on DataComp-Small, where the agent clearly beats the strongest filtering baseline (top-30% CLIP L/14 score).

But trajectory analysis shows what we call the execution-research gap: agents grind local knobs (source ratios, length thresholds, random seeds) instead of exploring new method families. In a typical open-prompt run, only 2/10 iterations try something genuinely new. Strategy guides and paper references don't fix it. A scaffold requiring each iteration to cite, instantiate, and adapt a method from prior research does: the agent composed an EL2N-style top-loss + noise-filter policy, with no human design input, that beats published baselines given 10x its data budget.

One more finding we find intriguing: curation search itself scales. Extending the agent budget from 10 to 50 iterations keeps improving average outcomes with no clear plateau. Agent search iterations look like a meaningful compute axis for the finite-data regime.

Environment, trajectory diagnostics, and all scaffolds are open source: https://github.com/feiyang-k/curation-bench. Happy to answer questions.

librarian-bot

5 minutes ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.04261 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.04261 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.04261 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Can Generalist Agents Automate Data Curation?

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers