r/LocalLLaMA · · 1 min read

I benchmarked full tool catalog vs ranked catalog on a local model: 8% → 77% accuracy

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Been running agents locally for a while and kept hitting the same issue: the more tools I added, the worse the model got at picking the right one.. So I finally benchmarked it properly..

Setup: qwen3.5-class model on an M4 MacBook, 100 tools in the catalog. One run with the full catalog every turn, one where I ranked the tools per query (BM25 over plain text) and only passed the relevant ones..

Results:

  • Full catalog: ~8% task accuracy
  • Ranked: ~77%
  • Tokens: -57%

Same weights, same machine, same prompts.. Only difference was how many tool descriptions the model had to read past before choosing. At 20-30 tools it barely matters.. past ~100 it falls apart. The model isn't getting dumber, it's just drowning.

The ranking is deliberately simple, no embeddings, no extra LLM call. It's part of an open source project (Ratel) I help build, benchmark's here if you want to run it on your own setup: https://github.com/ratel-ai/ratel-bench

Anyone else seeing similar jumps (or different thresholds) with local models?

submitted by /u/AbjectBug5885
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA