Hugging Face Daily Papers · · 6 min read

An Empirical Study of Automating Agent Evaluation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongsideAgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.</p>\n","updatedAt":"2026-05-14T20:45:15.854Z","author":{"_id":"64a3d9f467abe0e3d360151d","avatarUrl":"/avatars/d0d8772cfdb6e8571f456ef31c7871f1.svg","fullname":"Sangmin Woo","name":"sangminwoo","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8427151441574097},"editors":["sangminwoo"],"editorAvatarUrls":["/avatars/d0d8772cfdb6e8571f456ef31c7871f1.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.11378","authors":[{"_id":"6a03c8b686b054ce2fa40cb7","name":"Kang Zhou","hidden":false},{"_id":"6a03c8b686b054ce2fa40cb8","name":"Sangmin Woo","hidden":false},{"_id":"6a03c8b686b054ce2fa40cb9","name":"Haibo Ding","hidden":false},{"_id":"6a03c8b686b054ce2fa40cba","name":"Kiran Ramnath","hidden":false},{"_id":"6a03c8b686b054ce2fa40cbb","name":"Subramanian Chidambaram","hidden":false},{"_id":"6a03c8b686b054ce2fa40cbc","name":"Aosong Feng","hidden":false},{"_id":"6a03c8b686b054ce2fa40cbd","name":"Vinayak Arannil","hidden":false},{"_id":"6a03c8b686b054ce2fa40cbe","name":"Muhyun Kim","hidden":false},{"_id":"6a03c8b686b054ce2fa40cbf","name":"Ishan Singh","hidden":false},{"_id":"6a03c8b686b054ce2fa40cc0","name":"Darren Wang","hidden":false},{"_id":"6a03c8b686b054ce2fa40cc1","name":"Zhichao Xu","hidden":false},{"_id":"6a03c8b686b054ce2fa40cc2","name":"Megha Gandhi","hidden":false},{"_id":"6a03c8b686b054ce2fa40cc3","name":"Nirmal Prabhu","hidden":false},{"_id":"6a03c8b686b054ce2fa40cc4","name":"Soumya Smruti Mishra","hidden":false},{"_id":"6a03c8b686b054ce2fa40cc5","name":"Vivek Singh","hidden":false},{"_id":"6a03c8b686b054ce2fa40cc6","name":"Gouri Pandeshwar","hidden":false},{"_id":"6a03c8b686b054ce2fa40cc7","name":"Lin Lee Cheong","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"An Empirical Study of Automating Agent Evaluation","submittedOnDailyBy":{"_id":"64a3d9f467abe0e3d360151d","avatarUrl":"/avatars/d0d8772cfdb6e8571f456ef31c7871f1.svg","isPro":false,"fullname":"Sangmin Woo","user":"sangminwoo","type":"user","name":"sangminwoo"},"summary":"Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.","upvotes":1,"discussionId":"6a03c8b686b054ce2fa40cc8","githubRepo":"https://github.com/awslabs/Agent-EvalKit","githubRepoAddedBy":"user","ai_summary":"Automated agent evaluation using AI assistants requires specialized domain knowledge and procedural skills to achieve reliable results, as demonstrated by the EvalAgent system that improves evaluation accuracy through structured evaluation skills and a meta-evaluation framework.","ai_keywords":["agent evaluation","coding assistants","evaluation skills","trace-based pipeline","meta-evaluation framework","AgentEvalBench","Eval@1","ablation studies"],"githubStars":16,"organization":{"_id":"5ffdfbadbba2ae614d771970","name":"amazon","fullname":"Amazon","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66f19ed428ae41c20c470792/8y7msN6A6W82LdQhQd85a.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64a3d9f467abe0e3d360151d","avatarUrl":"/avatars/d0d8772cfdb6e8571f456ef31c7871f1.svg","isPro":false,"fullname":"Sangmin Woo","user":"sangminwoo","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5ffdfbadbba2ae614d771970","name":"amazon","fullname":"Amazon","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66f19ed428ae41c20c470792/8y7msN6A6W82LdQhQd85a.png"}}">
Papers
arxiv:2605.11378

An Empirical Study of Automating Agent Evaluation

Published on May 12
· Submitted by
Sangmin Woo
on May 14
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Automated agent evaluation using AI assistants requires specialized domain knowledge and procedural skills to achieve reliable results, as demonstrated by the EvalAgent system that improves evaluation accuracy through structured evaluation skills and a meta-evaluation framework.

AI-generated summary

Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.

Community

Paper submitter about 5 hours ago

Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongsideAgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.11378 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.11378 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.11378 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers