r/MachineLearning · · 2 min read

Tool selection at scale is a retrieval problem, and document-style defaults are the wrong starting point [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

A pattern I keep running into building agents. Posting as a discussion because I think the standard intuition is backwards for this specific case.

Setup is an agent with a big set of callable tools (mine are MCP-exposed, but the shape generalises to any function-calling loop). You can't put all of them in front of the model every turn. Past a certain catalog size selection accuracy drops, and just carrying every definition becomes the dominant token cost of the loop before any actual work happens. So you retrieve a relevant subset per request, which makes tool selection a retrieval problem..

The instinct from document RAG is semantic embeddings.. embed the query, embed each tool description, rank by cosine similarity.. I assumed this going in, and for tool selection it lost to a plain lexical baseline in my evals.

It's the shape of the data. Tool descriptions are short, structurally similar (verb-noun, a parameter list), and the thing that actually discriminates is often a single token, some property name like repo_id or channel. Cosine over short near-identical strings smears that. "list the open issues" and "list the channel messages" land close together because they share most of their tokens, and the noun that decides the right tool gets diluted. BM25 over a flat-text projection of name, description, and a walk of the input and output schema keeps that discriminator sharp, and as a bonus it needs no embedding model and runs fully in-process..

Underneath, it's just that tools live in a smaller, more structured space than documents do. The signal is keyword-shaped, which is exactly what BM25 is for. The document-RAG default (semantic primary, hybrid rerank) assumes paragraph-length, semantically rich chunks. Tool catalogs are the opposite, so the default transfers badly.

Not saying semantic is useless here. Above some catalog size, or for fuzzy intent, a semantic or hybrid layer probably earns its place, and that's where I'd expect the frontier to move. But the right primary for tool selection today, in my testing, is lexical, with semantic as the optional add rather than the default. There's an open benchmark that scores this over a 43,000-tool corpus with labeled relevance, comparable to the ToolRet leaderboard, if anyone wants to reproduce it or argue with it: https://github.com/ratel-ai/ratel

Would like to hear from anyone who's measured semantic beating lexical on tool selection at scale, because the BM25-at-200-plus question is where I'm least sure of my own result.

submitted by /u/AbjectBug5885
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning