How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I'm building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with "it depends."
It does depend. So let me split it into two jobs:
(a) Heavy one-shot generation — write a 400-line module, refactor a big file. That wants a big model, no argument. In my setup I route this to a dedicated coding model; I don't ask the loop model to do it.
(b) The orchestration loop itself — read this, decide which tool, call it with the right arguments, look at the result, react. This post is only about (b).
For (b): how small can that model get before the loop stops being trustworthy? My balance point right now is Qwen3.6-35B-A3B (MoE, ~3B active) — the lightest setup where the loop holds up, still fine on a 12GB card with 30 expert offload (running 40 t/s prompt gen). Below that it degrades, and I've been trying to pin down what degrades first.
It isn't reasoning. It's tool-call discipline. The model gets the intent right and then botches the call. Examples from smaller models I tested:
- passes
overwrite=trueto anappend_filetool that has no such parameter - calls
grep_searchwith anoutput_modearg that doesn't exist — it generalized it from a different tool - tries to invoke a
conclusion"tool" that was never a tool, because finishing the task feels like an action - passes
overwriteagain to yet another tool, having "learned" the wrong lesson from an earlier call
Over-generalized or invented parameters. The 35B-A3B does this rarely; small dense models do it constantly.
Two things I tried to push the floor lower:
Exposing the exact tool signature in the system prompt — generated
tool_name(arg1, arg2, opt=default)straight from the function, next to each tool, so the model sees the precise parameter list and, by omission, which parameters do NOT exist. Subjectively it helped a lot; not measured rigorously yet.Repetition watchdogs — small models get stuck repeating the same failing (tool, args) call while the observation keeps erroring; their model of the state has drifted. I fingerprint recent actions and inject a "stop, change strategy" hint after N identical failures. Works, but it's a band-aid.
What I'm after:
- For the orchestration role specifically — smallest model you actually trust in a loop?
- Is tool-call discipline the first thing that breaks for you too, or does something else go first?
- Better ways to make small models viable here — stricter tool schemas, light fine-tuning?
Repo's here if useful — still rough: https://github.com/homoagens/pragma
You can probably go smaller than people think — if you fix tool-call discipline instead of just reaching for a bigger model.
[link] [comments]
More from r/LocalLLaMA
-
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
May 22
-
trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser
May 22
-
ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop
May 22
-
Experts first llama.cpp
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.