r/LocalLLaMA · May 22, 2026 · 2 min read

How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I'm building a local-first agent — a plain ReAct loop (think, pick a tool, observe, repeat) on a llama.cpp backend — and I want to be precise about a question that usually just gets answered with "it depends."

It does depend. So let me split it into two jobs:

(a) Heavy one-shot generation — write a 400-line module, refactor a big file. That wants a big model, no argument. In my setup I route this to a dedicated coding model; I don't ask the loop model to do it.

(b) The orchestration loop itself — read this, decide which tool, call it with the right arguments, look at the result, react. This post is only about (b).

For (b): how small can that model get before the loop stops being trustworthy? My balance point right now is Qwen3.6-35B-A3B (MoE, ~3B active) — the lightest setup where the loop holds up, still fine on a 12GB card with 30 expert offload (running 40 t/s prompt gen). Below that it degrades, and I've been trying to pin down what degrades first.

It isn't reasoning. It's tool-call discipline. The model gets the intent right and then botches the call. Examples from smaller models I tested:

passes overwrite=true to an append_file tool that has no such parameter
calls grep_search with an output_mode arg that doesn't exist — it generalized it from a different tool
tries to invoke a conclusion "tool" that was never a tool, because finishing the task feels like an action
passes overwrite again to yet another tool, having "learned" the wrong lesson from an earlier call

Over-generalized or invented parameters. The 35B-A3B does this rarely; small dense models do it constantly.

Two things I tried to push the floor lower:

Exposing the exact tool signature in the system prompt — generated tool_name(arg1, arg2, opt=default) straight from the function, next to each tool, so the model sees the precise parameter list and, by omission, which parameters do NOT exist. Subjectively it helped a lot; not measured rigorously yet.
Repetition watchdogs — small models get stuck repeating the same failing (tool, args) call while the observation keeps erroring; their model of the state has drifted. I fingerprint recent actions and inject a "stop, change strategy" hint after N identical failures. Works, but it's a band-aid.

What I'm after:

For the orchestration role specifically — smallest model you actually trust in a loop?
Is tool-call discipline the first thing that breaks for you too, or does something else go first?
Better ways to make small models viable here — stricter tool schemas, light fine-tuning?

Repo's here if useful — still rough: https://github.com/homoagens/pragma

You can probably go smaller than people think — if you fix tool-call discipline instead of just reaching for a bigger model.

submitted by /u/HomoAgens1
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA