OpenSkillEval is an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. It first measures how well different models and agent frameworks handle real downstream tasks — with and without skill augmentation — and then runs controlled, same-task comparisons across community-contributed skills, logging quality alongside token and time cost. Spanning five real-world application categories and 600+ tasks.</p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"🌱-why-openskilleval\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🌱-why-openskilleval\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🌱 Why OpenSkillEval?\n\t</span>\n</h2>\n<ul>\n<li><strong>An audit of the open skill ecosystem</strong>, not just a model leaderboard — we ask whether community-contributed skill packs actually move the needle on real agentic work.</li>\n<li><strong>Five high-utility families</strong> that map to how people use agents today: data visualization, posters, slide decks, analytical reports, and web design.</li>\n<li><strong>Controlled skill-vs-baseline + concrete takeaways for skill authors</strong>: every skill pack runs head-to-head against a <code>no-skill</code> baseline on the same case set / same judge / same model, surfacing which design patterns (format, structure, prior richness) translate to real gains and which only add cost.</li>\n<li><strong>Joint quality + cost accounting</strong>: every run logs prompt / completion / cache tokens and wall-clock seconds, so you can read a skill's value against what it costs to invoke.</li>\n</ul>\n","updatedAt":"2026-06-01T03:31:49.684Z","author":{"_id":"671609f7664f44a151f1f0e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/fEQLuH1kdW5Pd9Y_J64hN.png","fullname":"jiahao ying","name":"jhying","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":4,"identifiedLanguage":{"language":"en","probability":0.8709353804588318},"editors":["jhying"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/fEQLuH1kdW5Pd9Y_J64hN.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.23657","authors":[{"_id":"6a198990808ddbc3c7d42d40","name":"Jiahao Ying","hidden":false},{"_id":"6a198990808ddbc3c7d42d41","name":"Boxian Ai","hidden":false},{"_id":"6a198990808ddbc3c7d42d42","name":"Wei Tang","hidden":false},{"_id":"6a198990808ddbc3c7d42d43","name":"Siyuan Liu","hidden":false},{"_id":"6a198990808ddbc3c7d42d44","name":"Yixin Cao","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/671609f7664f44a151f1f0e8/mYOjFJ3sfMlLvkPLP7ohl.png"],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents","submittedOnDailyBy":{"_id":"671609f7664f44a151f1f0e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/fEQLuH1kdW5Pd9Y_J64hN.png","isPro":false,"fullname":"jiahao ying","user":"jhying","type":"user","name":"jhying"},"summary":"Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present OpenSkillEval, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, OpenSkillEval automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.","upvotes":4,"discussionId":"6a198990808ddbc3c7d42d45","projectPage":"https://yingjiahao14.github.io/OpenSkillEval-Web/","githubRepo":"https://github.com/ALEX-nlp/OpenSkillEval","githubRepoAddedBy":"user","ai_summary":"OpenSkillEval is an automatic evaluation framework that assesses skill-augmented agent systems and skills across diverse real-world applications, revealing that skill availability doesn't guarantee effective usage and that performance benefits depend heavily on model and framework combinations.","ai_keywords":["large language models","agent systems","skill evaluation","automated evaluation framework","real-world artifacts","task instances","community-contributed skills","controlled comparison","systematic evaluation","skill augmentation"],"githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64bc8c2b6999b520ed4f9f00","avatarUrl":"/avatars/1c7982add85a386b3a030b0c081e12dd.svg","isPro":false,"fullname":"Wei Tang","user":"wtang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"671609f7664f44a151f1f0e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/fEQLuH1kdW5Pd9Y_J64hN.png","isPro":false,"fullname":"jiahao ying","user":"jhying","type":"user"},{"_id":"68e9d5540a03eedd74881313","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e9d5540a03eedd74881313/NmiP7yZsF0DA9G35q1zEW.jpeg","isPro":false,"fullname":"Boxian Ai","user":"Loasaster","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.23657.md"}">
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
Abstract
OpenSkillEval is an automatic evaluation framework that assesses skill-augmented agent systems and skills across diverse real-world applications, revealing that skill availability doesn't guarantee effective usage and that performance benefits depend heavily on model and framework combinations.
AI-generated summary
Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present OpenSkillEval, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, OpenSkillEval automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.
Community
OpenSkillEval is an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. It first measures how well different models and agent frameworks handle real downstream tasks — with and without skill augmentation — and then runs controlled, same-task comparisons across community-contributed skills, logging quality alongside token and time cost. Spanning five real-world application categories and 600+ tasks.
🌱 Why OpenSkillEval?
- An audit of the open skill ecosystem, not just a model leaderboard — we ask whether community-contributed skill packs actually move the needle on real agentic work.
- Five high-utility families that map to how people use agents today: data visualization, posters, slide decks, analytical reports, and web design.
- Controlled skill-vs-baseline + concrete takeaways for skill authors: every skill pack runs head-to-head against a
no-skill baseline on the same case set / same judge / same model, surfacing which design patterns (format, structure, prior richness) translate to real gains and which only add cost.
- Joint quality + cost accounting: every run logs prompt / completion / cache tokens and wall-clock seconds, so you can read a skill's value against what it costs to invoke.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.23657 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.23657 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.