Hugging Face Daily Papers · · 5 min read

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

LLM agents often call tools even when they do not need to. Our paper introduces When2Tool, a benchmark for tool-necessity decisions, and shows that models’ hidden states already know when tools are needed better than their verbal reasoning does. We use this signal in Probe&amp;Prefill, reducing tool calls by 48% with only 1.7% accuracy loss, and reducing real-world Search-o1 API calls by 20–56% with no accuracy drop.</p>\n","updatedAt":"2026-05-13T19:25:48.305Z","author":{"_id":"64b6076821c601b486d217a3","avatarUrl":"/avatars/bfd680028e10683b6c0544eb24006246.svg","fullname":"Chung-En, Sun","name":"cesun","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9297024011611938},"editors":["cesun"],"editorAvatarUrls":["/avatars/bfd680028e10683b6c0544eb24006246.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.09252","authors":[{"_id":"6a03aadb86b054ce2fa40c80","name":"Chung-En Sun","hidden":false},{"_id":"6a03aadb86b054ce2fa40c81","name":"Linbo Liu","hidden":false},{"_id":"6a03aadb86b054ce2fa40c82","name":"Ge Yan","hidden":false},{"_id":"6a03aadb86b054ce2fa40c83","name":"Zimo Wang","hidden":false},{"_id":"6a03aadb86b054ce2fa40c84","name":"Tsui-Wei Weng","hidden":false}],"publishedAt":"2026-05-10T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"LLM Agents Already Know When to Call Tools -- Even Without Reasoning","submittedOnDailyBy":{"_id":"64b6076821c601b486d217a3","avatarUrl":"/avatars/bfd680028e10683b6c0544eb24006246.svg","isPro":false,"fullname":"Chung-En, Sun","user":"cesun","type":"user","name":"cesun"},"summary":"Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5times higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool","upvotes":1,"discussionId":"6a03aadb86b054ce2fa40c85","projectPage":"https://lilywenglab.github.io/when2tool/","githubRepo":"https://github.com/Trustworthy-ML-Lab/when2tool","githubRepoAddedBy":"user","ai_summary":"When2Tool benchmark identifies conditions under which tool calls are necessary for LLM agents, revealing that models can predict tool necessity from hidden states but fail to act on this knowledge, leading to the development of Probe&Prefill method that reduces unnecessary calls by 48% with minimal accuracy loss.","ai_keywords":["tool-augmented LLM agents","benchmark","tool necessity","hidden states","linear decodable","AUROC","Probe&Prefill","pre-generation representation","steering sentence"],"githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b6076821c601b486d217a3","avatarUrl":"/avatars/bfd680028e10683b6c0544eb24006246.svg","isPro":false,"fullname":"Chung-En, Sun","user":"cesun","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.09252.md"}">
Papers
arxiv:2605.09252

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Published on May 10
· Submitted by
Chung-En, Sun
on May 13
Authors:
,
,
,
,

Abstract

When2Tool benchmark identifies conditions under which tool calls are necessary for LLM agents, revealing that models can predict tool necessity from hidden states but fail to act on this knowledge, leading to the development of Probe&Prefill method that reduces unnecessary calls by 48% with minimal accuracy loss.

AI-generated summary

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5times higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

Community

Paper submitter about 2 hours ago

LLM agents often call tools even when they do not need to. Our paper introduces When2Tool, a benchmark for tool-necessity decisions, and shows that models’ hidden states already know when tools are needed better than their verbal reasoning does. We use this signal in Probe&Prefill, reducing tool calls by 48% with only 1.7% accuracy loss, and reducing real-world Search-o1 API calls by 20–56% with no accuracy drop.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.09252
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.09252 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.09252 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers