When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents</p>\n","updatedAt":"2026-06-08T02:45:35.861Z","author":{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","fullname":"dongsheng zhu","name":"dongsheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5677149891853333},"editors":["dongsheng"],"editorAvatarUrls":["/avatars/e79c250cf8031441ffd0e853e653cef6.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05806","authors":[{"_id":"6a2244da3490a593e87b1514","user":{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","isPro":false,"fullname":"dongsheng zhu","user":"dongsheng","type":"user","name":"dongsheng"},"name":"Dongsheng Zhu","status":"claimed_verified","statusLastChangedAt":"2026-06-05T15:06:33.395Z","hidden":false},{"_id":"6a2244da3490a593e87b1515","name":"Xuchen Ma","hidden":false},{"_id":"6a2244da3490a593e87b1516","name":"Yucheng Shen","hidden":false},{"_id":"6a2244da3490a593e87b1517","name":"Xiang Li","hidden":false},{"_id":"6a2244da3490a593e87b1518","name":"Yukun Zhao","hidden":false},{"_id":"6a2244da3490a593e87b1519","name":"Shuaiqiang Wang","hidden":false},{"_id":"6a2244da3490a593e87b151a","name":"Lingyong Yan","hidden":false},{"_id":"6a2244da3490a593e87b151b","name":"Dawei Yin","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents","submittedOnDailyBy":{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","isPro":false,"fullname":"dongsheng zhu","user":"dongsheng","type":"user","name":"dongsheng"},"summary":"Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.","upvotes":13,"discussionId":"6a2244da3490a593e87b151c","githubRepo":"https://github.com/Zhudongsheng75/ToolMaze","githubRepoAddedBy":"user","ai_summary":"ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.","ai_keywords":["Tool-Integrated Reasoning","TIR","benchmark","dynamic path discovery","error recovery","DAG-based topological complexity","tool perturbations","implicit semantic failures","Perturbation Recovery Rate","PRR","agentic fault-tolerance","model scale","dynamic replanning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"626a6d6b4909b521e1f59ce5","name":"baidu","fullname":"BAIDU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f187a2cc1c03340ac30498/TYYUxK8xD1AxExFMWqbZD.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","isPro":false,"fullname":"dongsheng zhu","user":"dongsheng","type":"user"},{"_id":"69087415fbacae3388af1427","avatarUrl":"/avatars/ea028ef784ebee33fe3e6031cd8b31b2.svg","isPro":false,"fullname":"XuchenMa","user":"sfadcasdcasdc","type":"user"},{"_id":"6603e8e991117290da5ead98","avatarUrl":"/avatars/e25d14fb0168b40a649b30436e0e0465.svg","isPro":false,"fullname":"Haoran Wang","user":"hrwang","type":"user"},{"_id":"66d80055f884c481b33a8d2a","avatarUrl":"/avatars/105aaeaef000a193cb8c7f6e7cb0495c.svg","isPro":false,"fullname":"skiyunh","user":"skiyunh","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"69edaee347cf051633d3d781","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edaee347cf051633d3d781/6HparIrdwV8r9QKp_CqYs.png","isPro":false,"fullname":"ZiqianChen","user":"JettyCoffee","type":"user"},{"_id":"66bb5e1b3509ae3eabb2ae0d","avatarUrl":"/avatars/700765dc83329ff0be83bfebba604d40.svg","isPro":false,"fullname":"CanXu","user":"leoxc","type":"user"},{"_id":"64eb4f5507987950ae5e2b0f","avatarUrl":"/avatars/77dc21b195bc94490e45bfe208abfbc4.svg","isPro":false,"fullname":"Jiapeng Zhu","user":"JasonZhujp","type":"user"},{"_id":"660d0633f04ebfca62704513","avatarUrl":"/avatars/7e7416e5d0dc4ec4380aa5b9c4ea03bb.svg","isPro":false,"fullname":"Jiaming Zhang","user":"DonFinliani","type":"user"},{"_id":"690d8ff1238a29cc19018445","avatarUrl":"/avatars/1b4cab1c4b04cd72ee63a3dd040ae5f4.svg","isPro":false,"fullname":"qinnan cai","user":"qinnancai0115","type":"user"},{"_id":"64b2f97434a92b848c7e941e","avatarUrl":"/avatars/c699c50f3b43cd1641469521127753bb.svg","isPro":false,"fullname":"Nagori","user":"MohammedNaeem","type":"user"},{"_id":"64d09c16c0c627dfa7f22599","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d09c16c0c627dfa7f22599/TCV-PmAmPcbRpd2Nc11CL.jpeg","isPro":false,"fullname":"jianxiangyu","user":"ffjasonyu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"626a6d6b4909b521e1f59ce5","name":"baidu","fullname":"BAIDU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f187a2cc1c03340ac30498/TYYUxK8xD1AxExFMWqbZD.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.05806.md"}">
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
Abstract
ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.
Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.
Community
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.05806 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.05806 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.