Hugging Face Daily Papers · · 3 min read

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents</p>\n","updatedAt":"2026-06-08T02:45:35.861Z","author":{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","fullname":"dongsheng zhu","name":"dongsheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5677149891853333},"editors":["dongsheng"],"editorAvatarUrls":["/avatars/e79c250cf8031441ffd0e853e653cef6.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05806","authors":[{"_id":"6a2244da3490a593e87b1514","user":{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","isPro":false,"fullname":"dongsheng zhu","user":"dongsheng","type":"user","name":"dongsheng"},"name":"Dongsheng Zhu","status":"claimed_verified","statusLastChangedAt":"2026-06-05T15:06:33.395Z","hidden":false},{"_id":"6a2244da3490a593e87b1515","name":"Xuchen Ma","hidden":false},{"_id":"6a2244da3490a593e87b1516","name":"Yucheng Shen","hidden":false},{"_id":"6a2244da3490a593e87b1517","name":"Xiang Li","hidden":false},{"_id":"6a2244da3490a593e87b1518","name":"Yukun Zhao","hidden":false},{"_id":"6a2244da3490a593e87b1519","name":"Shuaiqiang Wang","hidden":false},{"_id":"6a2244da3490a593e87b151a","name":"Lingyong Yan","hidden":false},{"_id":"6a2244da3490a593e87b151b","name":"Dawei Yin","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents","submittedOnDailyBy":{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","isPro":false,"fullname":"dongsheng zhu","user":"dongsheng","type":"user","name":"dongsheng"},"summary":"Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.","upvotes":13,"discussionId":"6a2244da3490a593e87b151c","githubRepo":"https://github.com/Zhudongsheng75/ToolMaze","githubRepoAddedBy":"user","ai_summary":"ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.","ai_keywords":["Tool-Integrated Reasoning","TIR","benchmark","dynamic path discovery","error recovery","DAG-based topological complexity","tool perturbations","implicit semantic failures","Perturbation Recovery Rate","PRR","agentic fault-tolerance","model scale","dynamic replanning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"626a6d6b4909b521e1f59ce5","name":"baidu","fullname":"BAIDU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f187a2cc1c03340ac30498/TYYUxK8xD1AxExFMWqbZD.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","isPro":false,"fullname":"dongsheng zhu","user":"dongsheng","type":"user"},{"_id":"69087415fbacae3388af1427","avatarUrl":"/avatars/ea028ef784ebee33fe3e6031cd8b31b2.svg","isPro":false,"fullname":"XuchenMa","user":"sfadcasdcasdc","type":"user"},{"_id":"6603e8e991117290da5ead98","avatarUrl":"/avatars/e25d14fb0168b40a649b30436e0e0465.svg","isPro":false,"fullname":"Haoran Wang","user":"hrwang","type":"user"},{"_id":"66d80055f884c481b33a8d2a","avatarUrl":"/avatars/105aaeaef000a193cb8c7f6e7cb0495c.svg","isPro":false,"fullname":"skiyunh","user":"skiyunh","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"69edaee347cf051633d3d781","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edaee347cf051633d3d781/6HparIrdwV8r9QKp_CqYs.png","isPro":false,"fullname":"ZiqianChen","user":"JettyCoffee","type":"user"},{"_id":"66bb5e1b3509ae3eabb2ae0d","avatarUrl":"/avatars/700765dc83329ff0be83bfebba604d40.svg","isPro":false,"fullname":"CanXu","user":"leoxc","type":"user"},{"_id":"64eb4f5507987950ae5e2b0f","avatarUrl":"/avatars/77dc21b195bc94490e45bfe208abfbc4.svg","isPro":false,"fullname":"Jiapeng Zhu","user":"JasonZhujp","type":"user"},{"_id":"660d0633f04ebfca62704513","avatarUrl":"/avatars/7e7416e5d0dc4ec4380aa5b9c4ea03bb.svg","isPro":false,"fullname":"Jiaming Zhang","user":"DonFinliani","type":"user"},{"_id":"690d8ff1238a29cc19018445","avatarUrl":"/avatars/1b4cab1c4b04cd72ee63a3dd040ae5f4.svg","isPro":false,"fullname":"qinnan cai","user":"qinnancai0115","type":"user"},{"_id":"64b2f97434a92b848c7e941e","avatarUrl":"/avatars/c699c50f3b43cd1641469521127753bb.svg","isPro":false,"fullname":"Nagori","user":"MohammedNaeem","type":"user"},{"_id":"64d09c16c0c627dfa7f22599","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d09c16c0c627dfa7f22599/TCV-PmAmPcbRpd2Nc11CL.jpeg","isPro":false,"fullname":"jianxiangyu","user":"ffjasonyu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"626a6d6b4909b521e1f59ce5","name":"baidu","fullname":"BAIDU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f187a2cc1c03340ac30498/TYYUxK8xD1AxExFMWqbZD.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.05806.md"}">
Papers
arxiv:2606.05806

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Published on Jun 4
· Submitted by
dongsheng zhu
on Jun 8
Authors:
,
,
,
,
,
,

Abstract

ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

Community

Paper author Paper submitter about 6 hours ago

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.05806
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05806 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05806 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers