r/MachineLearning · · 3 min read

Why I stopped using semantic embeddings for tool selection and switched back to BM25 [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

I've been building agents for about a year and recently shipped one for a client running ~140 MCP-exposed tools at peak. Along the way I made the canonical mistake. I used cosine similarity over tool description embeddings to pick which tools the model could see per turn. Worked great in demos. Was actively dangerous in production.

Here's the problem. In a basic semantic-ranking setup you embed the user query, embed every tool description once, and rank by cosine similarity at runtime. That works for general document retrieval where chunks are paragraph-length, semantically rich, and roughly equal in form.

Tool descriptions are not that. They are short (often <50 tokens), structurally similar (verb-noun, parameters list), and the discriminative information is often a single keyword. "Read a file from disk" and "Read messages from a channel" both embed close to "read" + "file/channel." Cosine similarity puts them next to each other for a query like "read the latest commits" because all three words share the verb embedding space, and the actual discriminator (the noun "commits") gets diluted.

I watched this happen in eval. Asked the agent "list the open issues for this repo." The semantic ranker returned slack_search_messages first because the description had "list", "open", and "issues" as close embedding neighbors. The actual github_list_issues tool ranked 4th because the GitHub MCP author wrote a terse "Lists issues in a repository" description that scored lower on every soft keyword.

If the model sees slack_search_messages first and github_list_issues fourth, it's going to pick the wrong one. Often.

So I built three retrieval strategies and tested them on a fixed corpus of 200 query→correct-tool pairs.

Semantic embeddings (text-embedding-3-small): 64% top-1 accuracy. Sneaky failure mode: when wrong, it was confidently wrong, often with a totally unrelated tool ranked first.

BM25 over a flat-text projection of tool name + description + schema walk: 81% top-1. Failures were almost always lexical (the tool used "fetch" while the user said "get"), recoverable with light query rewriting.

Hybrid (0.7 semantic + 0.3 BM25 normalized): 78%. Worse than BM25 alone. The semantic noise dragged BM25's clean signal down.

I sat with that result for a while. The "obvious" answer is hybrid; every RAG paper since 2023 says hybrid wins. For tool selection specifically, hybrid lost. The reason is that tools live in a smaller, more structured space than documents do. The discriminative signal is keyword-shaped. BM25 is built for exactly that.

The other thing I learned: indexing schema fields matters. The clean BM25 win came from projecting name + description + a walk over input_schema and output_schema (semantic tokens only, JSON Schema structure stripped). Property names like repo_id or branch are exactly the discriminators that turn "list the open issues" into a hit on GitHub instead of Slack. If you only index name + description you leave half your signal on the floor.

I ended up adopting Ratel's indexing approach (their ADR-0004 documents the exact projection) because rebuilding it myself was redundant. Open source, in-process Rust, NAPI-RS bound to a TS SDK, no infra. The semantic + re-ranking story is on their roadmap, but for now the BM25-only default is what I want anyway. Happy to share it in the comments if anyone wants to try.

The takeaway for anyone building tool selection or agent gateways: do not assume document-RAG defaults transfer. Tools are a different shape of data. BM25 is not the boring fallback; for this problem it's the right primary and semantic is the optional add. Test your specific corpus before you reach for embeddings.

submitted by /u/AbjectBug5885
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning