Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization (Code will be released at <a href=\"https://github.com/zjunlp/DataAgent\" rel=\"nofollow\">https://github.com/zjunlp/DataAgent</a>).</p>\n","updatedAt":"2026-06-01T02:02:04.474Z","author":{"_id":"6441f1d2603214724ec0c1c2","avatarUrl":"/avatars/d3c4b759e6a5635e37ff715fae52e5ba.svg","fullname":"Shumin Deng","name":"231sm","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9185666441917419},"editors":["231sm"],"editorAvatarUrls":["/avatars/d3c4b759e6a5635e37ff715fae52e5ba.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30407","authors":[{"_id":"6a1ce73c808ddbc3c7d433dd","name":"Yujie Luo","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433de","name":"Xiangyuan Ru","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433df","name":"Jingsheng Zheng","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433e0","name":"Jingjing Wang","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433e1","name":"Yuqi Zhu","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433e2","name":"Jintian Zhang","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433e3","name":"Runnan Fang","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433e4","name":"Kewei Xu","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433e5","name":"Ye Liu","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433e6","name":"Zheng Wei","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433e7","name":"Jiang Bian","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433e8","name":"Zang Li","hidden":false},{"_id":"6a1ce73c808ddbc3c7d433e9","name":"Shumin Deng","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Exploring Autonomous Agentic Data Engineering for Model Specialization","submittedOnDailyBy":{"_id":"6441f1d2603214724ec0c1c2","avatarUrl":"/avatars/d3c4b759e6a5635e37ff715fae52e5ba.svg","isPro":false,"fullname":"Shumin Deng","user":"231sm","type":"user","name":"231sm"},"summary":"Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29\\%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specializationCode will be released at https://github.com/zjunlp/DataAgent..","upvotes":16,"discussionId":"6a1ce73d808ddbc3c7d433ea","ai_summary":"Large language models can autonomously execute end-to-end data engineering pipelines for model specialization through iterative data adaptation and optimization.","ai_keywords":["large language models","data curation","autonomous agentic data engineering","end-to-end data engineering pipeline","model specialization","training data optimization","agent-driven data adaptation","iterative optimization","post-training performance improvement"],"organization":{"_id":"6345aadf5efccdc07f1365a5","name":"ZhejiangUniversity","fullname":"Zhejiang University","avatar":"https://www.gravatar.com/avatar/d1d414628877bec2958f95ad283c15e7?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6441f1d2603214724ec0c1c2","avatarUrl":"/avatars/d3c4b759e6a5635e37ff715fae52e5ba.svg","isPro":false,"fullname":"Shumin Deng","user":"231sm","type":"user"},{"_id":"6a17c715d8eef017751231f6","avatarUrl":"/avatars/5a48c70c73b21aa4a86fbaa6c442ffaf.svg","isPro":false,"fullname":"Xiaoben Lu","user":"xiaoben7","type":"user"},{"_id":"620b3bbb0668e435407c8d0a","avatarUrl":"/avatars/e0fccbb2577d76088e09f054c35cffbc.svg","isPro":false,"fullname":"Ningyu Zhang","user":"Ningyu","type":"user"},{"_id":"63a942dd2e05ca32e35335df","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63a942dd2e05ca32e35335df/kuKfBLEXfWnvnoUUmoXW6.jpeg","isPro":false,"fullname":"haoming xu","user":"haomingx","type":"user"},{"_id":"679e1f7c31bab0a2a309d61f","avatarUrl":"/avatars/116912ef6a154edec9d589e0e0597fc9.svg","isPro":false,"fullname":"Zhenqian","user":"ZhenqianXu","type":"user"},{"_id":"6776ae0c91b4c75dac91249c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6776ae0c91b4c75dac91249c/uJk3ZnRrzjPCcBNjmrWLI.png","isPro":false,"fullname":"Oran Feng","user":"xiachongfeng","type":"user"},{"_id":"684bc1be17ae31ba66171292","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/684bc1be17ae31ba66171292/LFlkU4kArMjSzIbwjXd44.jpeg","isPro":false,"fullname":"Jingsheng Zheng","user":"JohnsonZheng03","type":"user"},{"_id":"665ebae8bcbb98f60db0b4b1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/665ebae8bcbb98f60db0b4b1/YTKM4qTZXh_2SeU8U7BfB.webp","isPro":false,"fullname":"Jiale Zhao","user":"Heisenburger2000","type":"user"},{"_id":"679b6d152583e6629337c592","avatarUrl":"/avatars/7ad77d7384fd41ad987b6eb5907a8f08.svg","isPro":false,"fullname":"Yuqi","user":"Yukirsh","type":"user"},{"_id":"66abc6da92b9eb71fe476118","avatarUrl":"/avatars/6d1618f45cc76da80335ad926ad24552.svg","isPro":false,"fullname":"xy.r","user":"ShawnRu","type":"user"},{"_id":"67026ef05ce58dd0c3fc0d1c","avatarUrl":"/avatars/94d907941a00ddc9a8030b5c6772bc59.svg","isPro":false,"fullname":"xukewei","user":"xukewei","type":"user"},{"_id":"6663430fd71a4e1e6ccc802c","avatarUrl":"/avatars/bcb4d87840772f861cabc439c1699329.svg","isPro":false,"fullname":"Baochang Ren","user":"BaochangRen","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6345aadf5efccdc07f1365a5","name":"ZhejiangUniversity","fullname":"Zhejiang University","avatar":"https://www.gravatar.com/avatar/d1d414628877bec2958f95ad283c15e7?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30407.md"}">
Exploring Autonomous Agentic Data Engineering for Model Specialization
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Large language models can autonomously execute end-to-end data engineering pipelines for model specialization through iterative data adaptation and optimization.
AI-generated summary
Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29\%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specializationCode will be released at https://github.com/zjunlp/DataAgent..
Community
Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization (Code will be released at https://github.com/zjunlp/DataAgent).
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.30407 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.30407 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.30407 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.