Hugging Face Daily Papers · June 8, 2026 · 3 min read

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents</p>\n","updatedAt":"2026-06-08T02:45:35.861Z","author":{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","fullname":"dongsheng zhu","name":"dongsheng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5677149891853333},"editors":["dongsheng"],"editorAvatarUrls":["/avatars/e79c250cf8031441ffd0e853e653cef6.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05806","authors":[{"_id":"6a2244da3490a593e87b1514","user":{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","isPro":false,"fullname":"dongsheng zhu","user":"dongsheng","type":"user","name":"dongsheng"},"name":"Dongsheng Zhu","status":"claimed_verified","statusLastChangedAt":"2026-06-05T15:06:33.395Z","hidden":false},{"_id":"6a2244da3490a593e87b1515","name":"Xuchen Ma","hidden":false},{"_id":"6a2244da3490a593e87b1516","name":"Yucheng Shen","hidden":false},{"_id":"6a2244da3490a593e87b1517","name":"Xiang Li","hidden":false},{"_id":"6a2244da3490a593e87b1518","name":"Yukun Zhao","hidden":false},{"_id":"6a2244da3490a593e87b1519","name":"Shuaiqiang Wang","hidden":false},{"_id":"6a2244da3490a593e87b151a","name":"Lingyong Yan","hidden":false},{"_id":"6a2244da3490a593e87b151b","name":"Dawei Yin","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents","submittedOnDailyBy":{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","isPro":false,"fullname":"dongsheng zhu","user":"dongsheng","type":"user","name":"dongsheng"},"summary":"Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.","upvotes":13,"discussionId":"6a2244da3490a593e87b151c","githubRepo":"https://github.com/Zhudongsheng75/ToolMaze","githubRepoAddedBy":"user","ai_summary":"ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.","ai_keywords":["Tool-Integrated Reasoning","TIR","benchmark","dynamic path discovery","error recovery","DAG-based topological complexity","tool perturbations","implicit semantic failures","Perturbation Recovery Rate","PRR","agentic fault-tolerance","model scale","dynamic replanning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"626a6d6b4909b521e1f59ce5","name":"baidu","fullname":"BAIDU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f187a2cc1c03340ac30498/TYYUxK8xD1AxExFMWqbZD.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630da0fae57da204209411d3","avatarUrl":"/avatars/e79c250cf8031441ffd0e853e653cef6.svg","isPro":false,"fullname":"dongsheng zhu","user":"dongsheng","type":"user"},{"_id":"69087415fbacae3388af1427","avatarUrl":"/avatars/ea028ef784ebee33fe3e6031cd8b31b2.svg","isPro":false,"fullname":"XuchenMa","user":"sfadcasdcasdc","type":"user"},{"_id":"6603e8e991117290da5ead98","avatarUrl":"/avatars/e25d14fb0168b40a649b30436e0e0465.svg","isPro":false,"fullname":"Haoran Wang","user":"hrwang","type":"user"},{"_id":"66d80055f884c481b33a8d2a","avatarUrl":"/avatars/105aaeaef000a193cb8c7f6e7cb0495c.svg","isPro":false,"fullname":"skiyunh","user":"skiyunh","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"69edaee347cf051633d3d781","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edaee347cf051633d3d781/6HparIrdwV8r9QKp_CqYs.png","isPro":false,"fullname":"ZiqianChen","user":"JettyCoffee","type":"user"},{"_id":"66bb5e1b3509ae3eabb2ae0d","avatarUrl":"/avatars/700765dc83329ff0be83bfebba604d40.svg","isPro":false,"fullname":"CanXu","user":"leoxc","type":"user"},{"_id":"64eb4f5507987950ae5e2b0f","avatarUrl":"/avatars/77dc21b195bc94490e45bfe208abfbc4.svg","isPro":false,"fullname":"Jiapeng Zhu","user":"JasonZhujp","type":"user"},{"_id":"660d0633f04ebfca62704513","avatarUrl":"/avatars/7e7416e5d0dc4ec4380aa5b9c4ea03bb.svg","isPro":false,"fullname":"Jiaming Zhang","user":"DonFinliani","type":"user"},{"_id":"690d8ff1238a29cc19018445","avatarUrl":"/avatars/1b4cab1c4b04cd72ee63a3dd040ae5f4.svg","isPro":false,"fullname":"qinnan cai","user":"qinnancai0115","type":"user"},{"_id":"64b2f97434a92b848c7e941e","avatarUrl":"/avatars/c699c50f3b43cd1641469521127753bb.svg","isPro":false,"fullname":"Nagori","user":"MohammedNaeem","type":"user"},{"_id":"64d09c16c0c627dfa7f22599","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64d09c16c0c627dfa7f22599/TCV-PmAmPcbRpd2Nc11CL.jpeg","isPro":false,"fullname":"jianxiangyu","user":"ffjasonyu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"626a6d6b4909b521e1f59ce5","name":"baidu","fullname":"BAIDU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f187a2cc1c03340ac30498/TYYUxK8xD1AxExFMWqbZD.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.05806.md"}">

Papers

arxiv:2606.05806

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Published on Jun 4

· Submitted by

dongsheng zhu on Jun 8

BAIDU

Upvote

Authors:

Dongsheng Zhu ,

Abstract

ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.