Hugging Face Daily Papers · June 23, 2026 · 5 min read

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Excited to share PlanBench-XL! We built this benchmark to evaluate whether LLM agents can really plan over long horizons in large, imperfect tool ecosystems, where they must iteratively retrieve tools, call them, and recover from missing, failing, or misleading tool access. The results show that even strong models still struggle a lot with adaptive recovery, especially when the only valid path becomes longer or less obvious. Happy to hear any thoughts, feedback, or suggestions!\n","updatedAt":"2026-06-23T03:00:54.150Z","author":{"_id":"66783baec3f824dde8f783ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66783baec3f824dde8f783ac/oqFYUrgs2vnGRhAMSrQpC.jpeg","fullname":"Jeff","name":"JiayuJeff","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9207931160926819},"editors":["JiayuJeff"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66783baec3f824dde8f783ac/oqFYUrgs2vnGRhAMSrQpC.jpeg"],"reactions":[{"reaction":"👍","users":["DhavalPatel","tunaaa126","Liumichun"],"count":3}],"isReport":false}},{"id":"6a3a8a4896b660cbbe55d5a6","author":{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","fullname":"Dhaval Patel","name":"DhavalPatel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false},"createdAt":"2026-06-23T13:29:44.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Interesting paper. Maybe some aspect from the Industrial domain could have given more additional dimension like \n\"Data Exchange\"\n\"Model Exchange\"\nhttps://github.com/IBM/AssetOpsBench \nGiving you a pointer to our repo. ","html":"Interesting paper. Maybe some aspect from the Industrial domain could have given more additional dimension like \"Data Exchange\" \"Model Exchange\" <a href=\"https://github.com/IBM/AssetOpsBench\" rel=\"nofollow\">https://github.com/IBM/AssetOpsBench</a> Giving you a pointer to our repo. \n","updatedAt":"2026-06-23T13:29:44.133Z","author":{"_id":"64c47f731d44fc06afc80953","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png","fullname":"Dhaval Patel","name":"DhavalPatel","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8872048854827881},"editors":["DhavalPatel"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UT2mHX2WuCm5Ws4rGKyCB.png"],"reactions":[],"isReport":false},"replies":[{"id":"6a3a91156e94f55ca32b3269","author":{"_id":"66783baec3f824dde8f783ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66783baec3f824dde8f783ac/oqFYUrgs2vnGRhAMSrQpC.jpeg","fullname":"Jeff","name":"JiayuJeff","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false},"createdAt":"2026-06-23T13:58:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks! Great work!","html":"Thanks! Great work!\n","updatedAt":"2026-06-23T13:58:45.810Z","author":{"_id":"66783baec3f824dde8f783ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66783baec3f824dde8f783ac/oqFYUrgs2vnGRhAMSrQpC.jpeg","fullname":"Jeff","name":"JiayuJeff","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8308127522468567},"editors":["JiayuJeff"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/66783baec3f824dde8f783ac/oqFYUrgs2vnGRhAMSrQpC.jpeg"],"reactions":[],"isReport":false,"parentCommentId":"6a3a8a4896b660cbbe55d5a6"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2606.22388","authors":[{"_id":"6a39f5e9fdcd3514343bb51d","user":{"_id":"66783baec3f824dde8f783ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66783baec3f824dde8f783ac/oqFYUrgs2vnGRhAMSrQpC.jpeg","isPro":false,"fullname":"Jeff","user":"JiayuJeff","type":"user","name":"JiayuJeff"},"name":"Jiayu Liu","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:56:50.325Z","hidden":false},{"_id":"6a39f5e9fdcd3514343bb51e","name":"Qihan Lin","hidden":false},{"_id":"6a39f5e9fdcd3514343bb51f","name":"Cheng Qian","hidden":false},{"_id":"6a39f5e9fdcd3514343bb520","name":"Rui Wang","hidden":false},{"_id":"6a39f5e9fdcd3514343bb521","name":"Emre Can Acikgoz","hidden":false},{"_id":"6a39f5e9fdcd3514343bb522","name":"Xiaocheng Yang","hidden":false},{"_id":"6a39f5e9fdcd3514343bb523","name":"Jiateng Liu","hidden":false},{"_id":"6a39f5e9fdcd3514343bb524","name":"Zhenhailong Wang","hidden":false},{"_id":"6a39f5e9fdcd3514343bb525","name":"Xiusi Chen","hidden":false},{"_id":"6a39f5e9fdcd3514343bb526","name":"Heng Ji","hidden":false},{"_id":"6a39f5e9fdcd3514343bb527","name":"Dilek Hakkani-Tür","hidden":false}],"publishedAt":"2026-06-21T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems","submittedOnDailyBy":{"_id":"66783baec3f824dde8f783ac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66783baec3f824dde8f783ac/oqFYUrgs2vnGRhAMSrQpC.jpeg","isPro":false,"fullname":"Jeff","user":"JiayuJeff","type":"user","name":"JiayuJeff"},"summary":"LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optional blocking mechanism that simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-tool planning remains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternative tool-use paths. These results establish PlanBench-XL as a testbed for diagnosing agentic planning failures and highlight the need for robust adaptive planning in long-horizon tasks with large, imperfect tool environments.","upvotes":79,"discussionId":"6a39f5eafdcd3514343bb528","projectPage":"https://planbench-xl.github.io/","githubRepo":"https://github.com/JiayuJeff/PlanBench-XL","githubRepoAddedBy":"user","ai_summary":"PlanBench-XL evaluates large language model agents' ability to plan and adapt in complex tool-rich environments with limited visibility and dynamic disruptions.","ai_keywords":["LLM agents","tool ecosystems","planning","retrieval-limited tool visibility","interactive benchmark","tool discovery","implicit sub-goals","dynamic environments","long horizons","blocking mechanism","agent adaptation","tool-use paths","agentic planning","robust planning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":26,"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6679a0e092e6bd0b961bfdb2","avatarUrl":"/avatars/0b7f66a25c1681d1984dc03552e8f42d.svg","isPro":false,"fullname":"LIU Jiayu","user":"JeffLiu2005","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"},{"_id":"68400c7b50cb0ac62e5fd9f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68400c7b50cb0ac62e5fd9f2/UqFfQFbFsCxjLIcwIwdFx.png","isPro":false,"fullname":"Qihan Lin","user":"tunaaa126","type":"user"},{"_id":"68087b4f3f5cc7179ae959a7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/l9skgMVKXJollx6BwNaWm.png","isPro":false,"fullname":"Xiaocheng Yang","user":"Xiaocheng-Yang","type":"user"},{"_id":"66fa2c61c25c3fcb32f9f131","avatarUrl":"/avatars/05387c30f2d1803fa0a5b176c3706772.svg","isPro":false,"fullname":"Yize Cheng","user":"yizecheng","type":"user"},{"_id":"6789d843c417d858f4fbefb3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/BBq5gEowXVXthEOHSy6AE.png","isPro":false,"fullname":"WANG Rui","user":"Roryaccout","type":"user"},{"_id":"676c04f44464f476aaa53d1c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/k488J1893F3JGwEMvaeuh.png","isPro":false,"fullname":"Chong Xia","user":"xiac24","type":"user"},{"_id":"69f3751c6718302e63845d46","avatarUrl":"/avatars/4c230ef076737851fcd99465824eb8b1.svg","isPro":false,"fullname":"anonymous","user":"anonymous-K12","type":"user"},{"_id":"63f5f99eb607296857b3a35a","avatarUrl":"/avatars/455c7596fbb0ff0622a8770594262414.svg","isPro":false,"fullname":"Haochen Shi","user":"hshiah","type":"user"},{"_id":"64ce05c631c655ff8a2e183c","avatarUrl":"/avatars/f2de7f8a1348b05f46946085e3e9718e.svg","isPro":false,"fullname":"Shijue Huang","user":"JoeYing","type":"user"},{"_id":"6449dbd8df4e6cb7eaef943e","avatarUrl":"/avatars/41a549a7b1cfe1d59ea16b3cbd2168cc.svg","isPro":false,"fullname":"ChengQ","user":"0Cheng0","type":"user"},{"_id":"63bc77661374e3ef9135735f","avatarUrl":"/avatars/94b04545ed9d30bfe58691672a0b5618.svg","isPro":false,"fullname":"Qingcheng Zeng","user":"qcz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":1,"organization":{"_id":"65448bef5b5d9185ba3202b9","name":"UIUC-CS","fullname":"University of Illinois at Urbana-Champaign","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65448b21fcb96b8b48733729/ycqcXFayMTTD_KpE37067.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.22388.md","query":{}}">

Papers

arxiv:2606.22388

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Published on Jun 21

· Submitted by

Jeff on Jun 23

#1 Paper of the day

University of Illinois at Urbana-Champaign

Upvote

Authors:

Jiayu Liu ,

Abstract

PlanBench-XL evaluates large language model agents' ability to plan and adapt in complex tool-rich environments with limited visibility and dynamic disruptions.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optional blocking mechanism that simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-tool planning remains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternative tool-use paths. These results establish PlanBench-XL as a testbed for diagnosing agentic planning failures and highlight the need for robust adaptive planning in long-horizon tasks with large, imperfect tool environments.

View arXiv page View PDF Project page GitHub 26 Add to collection

Community

JiayuJeff

Paper author Paper submitter about 22 hours ago

DhavalPatel

about 12 hours ago

Interesting paper. Maybe some aspect from the Industrial domain could have given more additional dimension like
"Data Exchange"
"Model Exchange"
https://github.com/IBM/AssetOpsBench
Giving you a pointer to our repo.

JiayuJeff

Paper author about 11 hours ago

Thanks! Great work!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.22388

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.22388 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.22388 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.22388 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

No comments yet. Sign in and be the first to say something.

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 2

Discussion (0)

More from Hugging Face Daily Papers