Hugging Face Daily Papers · June 2, 2026 · 4 min read

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

TASTE is a new way to automatically create diverse, harder, and verified benchmarks for tool-using AI agents.<br>Instead of writing tasks first, we start from the tool sequences agents need to execute, then synthesize realistic tasks around them.<br>The result: models that look strong on existing benchmarks face a much tougher and broader test.</p>\n","updatedAt":"2026-06-02T07:13:56.406Z","author":{"_id":"666ab6d38b6feadc10367851","avatarUrl":"/avatars/9eca7e2b33a3edbdff9e23904268d023.svg","fullname":"Tomer Keren","name":"tomer-keren","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9026991724967957},"editors":["tomer-keren"],"editorAvatarUrls":["/avatars/9eca7e2b33a3edbdff9e23904268d023.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.28556","authors":[{"_id":"6a1d3c0b808ddbc3c7d43707","user":{"_id":"666ab6d38b6feadc10367851","avatarUrl":"/avatars/9eca7e2b33a3edbdff9e23904268d023.svg","isPro":false,"fullname":"Tomer Keren","user":"tomer-keren","type":"user","name":"tomer-keren"},"name":"Tomer Keren","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:31:16.389Z","hidden":false},{"_id":"6a1d3c0b808ddbc3c7d43708","name":"Nitay Calderon","hidden":false},{"_id":"6a1d3c0b808ddbc3c7d43709","name":"Asaf Yehudai","hidden":false},{"_id":"6a1d3c0b808ddbc3c7d4370a","name":"Yotam Perlitz","hidden":false},{"_id":"6a1d3c0b808ddbc3c7d4370b","name":"Michal Shmueli-Scheuer","hidden":false},{"_id":"6a1d3c0b808ddbc3c7d4370c","name":"Roi Reichert","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/666ab6d38b6feadc10367851/l9G3-R8O7M0J9EqidORYd.png"],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks","submittedOnDailyBy":{"_id":"666ab6d38b6feadc10367851","avatarUrl":"/avatars/9eca7e2b33a3edbdff9e23904268d023.svg","isPro":false,"fullname":"Tomer Keren","user":"tomer-keren","type":"user","name":"tomer-keren"},"summary":"As agent capabilities advance, existing benchmarks, such as τ^2-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive n-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct τ^c-Bench, a challenging extension of the three domains of τ^2-Bench. We evaluate 11 agent/user LLM pairs and find that models nearly saturating τ^2-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from 0.82!-!0.94 to 0.28!-!0.61). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.","upvotes":41,"discussionId":"6a1d3c0b808ddbc3c7d4370d","githubRepo":"https://github.com/tomerkeren42/TASTE-task-synthesis-from-tool-sequence-evolution","githubRepoAddedBy":"user","ai_summary":"Automated benchmark generation method creates challenging tasks with broader tool-use coverage by evolving tool sequences through adaptive contrastive n-gram modeling and iterative difficulty refinement.","ai_keywords":["tool sequence evolution","adaptive contrastive n-gram model","LLM-judged validity signals","clustering","iterative difficulty evolution","task synthesis","benchmark construction","tool-use patterns","agent capabilities","automated evaluation"],"githubStars":1,"organization":{"_id":"6393322be2364bc1eea56e45","name":"Technion","fullname":"Technion Israel institute of technology","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670591001944-63926124526c29d5b5011374.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62d6a0c18faee0ac953c51fa","avatarUrl":"/avatars/ca818cebdb089a8d853c5bc4d5e0987b.svg","isPro":false,"fullname":"Nitay Calderon","user":"nitay","type":"user"},{"_id":"61e01a92a24b774dac1f32d7","avatarUrl":"/avatars/307aac0c6cfe1879f6a5e5e1d6c10413.svg","isPro":false,"fullname":"Yotam Perlitz","user":"per","type":"user"},{"_id":"666ab6d38b6feadc10367851","avatarUrl":"/avatars/9eca7e2b33a3edbdff9e23904268d023.svg","isPro":false,"fullname":"Tomer Keren","user":"tomer-keren","type":"user"},{"_id":"66f07f723b250e9eca4f01b6","avatarUrl":"/avatars/4cb85124caaf3a95d4fb9675fad36cb8.svg","isPro":false,"fullname":"lkadoch","user":"lkadoch","type":"user"},{"_id":"5f5b0efe10b2753d9000c888","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1628140531144-5f5b0efe10b2753d9000c888.jpeg","isPro":false,"fullname":"Elad Segal","user":"eladsegal","type":"user"},{"_id":"697f5745a7f796854e33518e","avatarUrl":"/avatars/558629de8f38d1152ded113e82aecfe6.svg","isPro":false,"fullname":"Yaniv Galron","user":"YanivGalron-NV","type":"user"},{"_id":"64e75884dafbba06bece1b31","avatarUrl":"/avatars/6bbdae8ba010ffbb3cc9917b846a7fae.svg","isPro":false,"fullname":"Ofer Baratz","user":"OBaratz","type":"user"},{"_id":"6685860d852db86b9c91a1f3","avatarUrl":"/avatars/ea25dd7427375b423af9c7b412826fef.svg","isPro":false,"fullname":"Najeeb Nabwani","user":"Najeebnv","type":"user"},{"_id":"6a1bf54bcbdf03bba29c9327","avatarUrl":"/avatars/17160624c0d15fe8280b0600a47bd177.svg","isPro":false,"fullname":"Shahar Mor","user":"smor98","type":"user"},{"_id":"67ab1ebc3711ca5b76168468","avatarUrl":"/avatars/cdc37ef073dddbf1513374f87d869c9a.svg","isPro":false,"fullname":"Amit Zuker","user":"azuker-nvidia","type":"user"},{"_id":"692427e6e78242c7232bdcec","avatarUrl":"/avatars/cd67f37c71576fd32d8794fed043271b.svg","isPro":false,"fullname":"Borys Tymchenko (NVIDIA)","user":"btymchenko-nvidia","type":"user"},{"_id":"668e7a379ab7631fcd434ee7","avatarUrl":"/avatars/68ddbc5e35125a779655e9033d24afe5.svg","isPro":false,"fullname":"Talor Abramovich","user":"talor-abr","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"6393322be2364bc1eea56e45","name":"Technion","fullname":"Technion Israel institute of technology","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1670591001944-63926124526c29d5b5011374.jpeg"}}">

Papers

arxiv:2605.28556

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Published on May 27

· Submitted by

Tomer Keren on Jun 2

#3 Paper of the day

Technion Israel institute of technology

Upvote

Authors:

Tomer Keren ,

Abstract

Automated benchmark generation method creates challenging tasks with broader tool-use coverage by evolving tool sequences through adaptive contrastive n-gram modeling and iterative difficulty refinement.

AI-generated summary

As agent capabilities advance, existing benchmarks, such as τ^2-Bench, are becoming increasingly saturated. Yet constructing new benchmark tasks remains complex, costly, and labor-intensive. Moreover, the standard approach, in which scenarios are first written in natural language and then mapped to tool sequences, captures only a narrow subset of the tool-use patterns agents exercise. In this paper, we address these problems by reversing the task construction process. We propose TASTE: Task Synthesis from Tool Sequence Evolution, an automatic method that generates challenging tasks with broader tool-use coverage. TASTE utilizes an Adaptive Contrastive n-gram model trained on LLM-judged validity signals. This enables sampling valid tool sequences that cover a vast range of tool combinations. TASTE then selects representative sequences from the pool via clustering, instantiates them into complete benchmark tasks, and refines them through iterative difficulty evolution. Using TASTE, we construct τ^c-Bench, a challenging extension of the three domains of τ^2-Bench. We evaluate 11 agent/user LLM pairs and find that models nearly saturating τ^2-Bench suffer severe performance drops on our tasks (e.g., Gemini-3-Flash falls from 0.82!-!0.94 to 0.28!-!0.61). Beyond increasing difficulty, our generated tasks more than double the number of unique tool combinations agents must execute. Our results suggest high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. By automating the generation of difficult, high-coverage benchmarks, TASTE enables continuous, scalable evaluation of future agents.

View arXiv page View PDF GitHub 1 Add to collection

Community

tomer-keren

Paper author Paper submitter about 3 hours ago

TASTE is a new way to automatically create diverse, harder, and verified benchmarks for tool-using AI agents.
Instead of writing tasks first, we start from the tool sequences agents need to execute, then synthesize realistic tasks around them.
The result: models that look strong on existing benchmarks face a much tougher and broader test.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.28556 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28556 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28556 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers