Hugging Face Daily Papers · June 12, 2026 · 6 min read

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

90% recall on standard benchmarks. However, these benchmarks rely on verbose, fully-specified queries and constrained trie decoding—making it impossible to tell if the model truly understands its tools or is simply pattern-matching.\n\nWe introduce ToolSense, an open-source diagnostic framework that automatically generates three benchmarks from any tool catalog: a Realistic Retrieval Benchmark (RRB) with user-style queries at three ambiguity levels, an MCQ factual probe, and a QA inferential probe. Applying ToolSense to ToolBench (~47k tools) reveals a striking knowledge-retrieval dissociation: top parametric configurations collapse by 50–64 percentage points on realistic queries, falling below dense embedding baselines. Factual probing further shows that Stage 2 retrieval fine-tuning systematically erases the tool knowledge acquired during Stage 1 memorization. The best mitigation we found is combining LoRA with multi-format memorization.","html":"<p>Parametric tool retrieval trains LLMs to act as their own retrievers by encoding tools as virtual tokens, achieving >90% recall on standard benchmarks. However, these benchmarks rely on verbose, fully-specified queries and constrained trie decoding—making it impossible to tell if the model truly understands its tools or is simply pattern-matching.</p>\n<p>We introduce ToolSense, an open-source diagnostic framework that automatically generates three benchmarks from any tool catalog: a Realistic Retrieval Benchmark (RRB) with user-style queries at three ambiguity levels, an MCQ factual probe, and a QA inferential probe. Applying ToolSense to ToolBench (~47k tools) reveals a striking knowledge-retrieval dissociation: top parametric configurations collapse by 50–64 percentage points on realistic queries, falling below dense embedding baselines. Factual probing further shows that Stage 2 retrieval fine-tuning systematically erases the tool knowledge acquired during Stage 1 memorization. The best mitigation we found is combining LoRA with multi-format memorization.</p>\n","updatedAt":"2026-06-12T13:07:42.851Z","author":{"_id":"637859f98f288aba3d01f588","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637859f98f288aba3d01f588/eP8YNMOtxTvxH-rYEYn4f.png","fullname":"Ashutosh Hathidara","name":"ashutosh1919","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f7a0ad07b08a2b3c3ac94e/lIw9a3y-z_5RGc7i64YPu.png","fullname":"SAP","name":"SAP","type":"org","isHf":false,"details":"Tabular AI, Table Representation Learning, Agents and Knowledge Graphs","plan":"team"}}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8551031351089478},"editors":["ashutosh1919"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/637859f98f288aba3d01f588/eP8YNMOtxTvxH-rYEYn4f.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12451","authors":[{"_id":"6a2c041ea0d4daae4285ef01","name":"Ashutosh Hathidara","hidden":false},{"_id":"6a2c041ea0d4daae4285ef02","name":"Sai Shruthi Sistla","hidden":false},{"_id":"6a2c041ea0d4daae4285ef03","name":"Sebastian Schreiber","hidden":false},{"_id":"6a2c041ea0d4daae4285ef04","name":"Sahil Bansal","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs","submittedOnDailyBy":{"_id":"637859f98f288aba3d01f588","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637859f98f288aba3d01f588/eP8YNMOtxTvxH-rYEYn4f.png","isPro":false,"fullname":"Ashutosh Hathidara","user":"ashutosh1919","type":"user","name":"ashutosh1919"},"summary":"Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce ToolSense, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.","upvotes":1,"discussionId":"6a2c041ea0d4daae4285ef05","githubRepo":"https://github.com/SAP/toolsense","githubRepoAddedBy":"user","ai_summary":"Parametric tool retrieval models show reduced performance and understanding when evaluated with realistic ambiguous queries compared to standard benchmarks, revealing a dissociation between knowledge retrieval and true tool comprehension.","ai_keywords":["large language models","tool retrieval","embedding-based retrieval","parametric tool retrieval","ToolBench","constrained decoding","diagnostic framework","retrieval benchmarks","knowledge-retrieval dissociation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"6152dcdfecf3ca6ab820e328","name":"SAP","fullname":"SAP","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f7a0ad07b08a2b3c3ac94e/lIw9a3y-z_5RGc7i64YPu.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"637859f98f288aba3d01f588","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637859f98f288aba3d01f588/eP8YNMOtxTvxH-rYEYn4f.png","isPro":false,"fullname":"Ashutosh Hathidara","user":"ashutosh1919","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6152dcdfecf3ca6ab820e328","name":"SAP","fullname":"SAP","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67f7a0ad07b08a2b3c3ac94e/lIw9a3y-z_5RGc7i64YPu.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12451.md","query":{}}">

Papers

arxiv:2606.12451

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Published on Jun 4

· Submitted by

Ashutosh Hathidara on Jun 12

SAP

Upvote

Authors:

Abstract

Parametric tool retrieval models show reduced performance and understanding when evaluated with realistic ambiguous queries compared to standard benchmarks, revealing a dissociation between knowledge retrieval and true tool comprehension.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce ToolSense, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.

View arXiv page View PDF GitHub 1 Add to collection

Community

ashutosh1919

Paper submitter about 8 hours ago

Parametric tool retrieval trains LLMs to act as their own retrievers by encoding tools as virtual tokens, achieving >90% recall on standard benchmarks. However, these benchmarks rely on verbose, fully-specified queries and constrained trie decoding—making it impossible to tell if the model truly understands its tools or is simply pattern-matching.

We introduce ToolSense, an open-source diagnostic framework that automatically generates three benchmarks from any tool catalog: a Realistic Retrieval Benchmark (RRB) with user-style queries at three ambiguity levels, an MCQ factual probe, and a QA inferential probe. Applying ToolSense to ToolBench (~47k tools) reveals a striking knowledge-retrieval dissociation: top parametric configurations collapse by 50–64 percentage points on realistic queries, falling below dense embedding baselines. Factual probing further shows that Stage 2 retrieval fine-tuning systematically erases the tool knowledge acquired during Stage 1 memorization. The best mitigation we found is combining LoRA with multi-format memorization.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.12451

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12451 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12451 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12451 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers