An agent that plans with a frontier model but runs most of tokens locally (built it for my own dual-3090 rig)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
For the past couple of months, I've been building a tool for my personal use. I have a dual RTX 3090 system which I wanted to use but the qwen 3.5/3.6 27B and Gemma 4 31B while being really good, just didn't have the taste or the ability that a frontier model has.
OTOH, frontier models are expensive and I didn't want everything I do running through them. I wanted the best of both worlds: frontier reasoning for the plan, local models doing almost all the actual work.
I have tried a few repos which do enable small models to perform above their weight by 'calling' frontier models, but that's not what I wanted. I want to be able to plan with the frontier model as my experience in software engineering over the last decade+ has taught me that design is the bottleneck in most projects and prevents spaghetti code/rewrites.
I created an agent and it took a lot of iterations but now I believe I have one and I'm using it for my personal use.
The crux of the agent is like this (it uses a lot of existing tools, no reinventing the wheel). But it's all customizable.
3 Tiers, all swappable with config file:
- Planner: Codex (extremely powerful; though anything that emits the decision JSON works here)
- Local: Qwen 3.6 27B (Great for agentic use and tool calling, good enough for coding)
- Senior (optional): Kimi K2.6 via opencode-go (When the local fails and retry attempts get exhausted)
You can have all 3 tiers local, 2 tiers local, one frontier one local or any combination. This is just what I found to work best.
Every task goes to codex, which can map it to N phases. Say a big coding task will usually map to 3 phases (research, implement, review).
Similarly a review task will also go into phases (review, artifact).
Each phase can also grind for multiple epochs, each epoch will give out tasks which the local models do (and do very well), all this is planned by codex.
The biggest differentiation is deterministic validation. A task only counts as done when a check actually passes, i.e. a command exits 0 or the file it was supposed to produce exists. The state machine re-runs those checks itself instead of trusting what the model says it did, so a multi-hour chain can't drift by claiming progress it never made.
I've found that this can enable local models to be much more capable than otherwise:
- Enables them to do tasks which span hours and hours
- Taste and capability of frontier model, but ~85-90% (based on my measurement) of tokens go through local models. For output tokens it's ~95%.
- Context isolation, prevents context rot and the frontier model is much cheaper because the context window doesn't overflow with bash calls.
- Also does some useful stuff by default: uses a repomapper to map the repo as a graph, and curates context fairly aggressively so the local models aren't drowning in irrelevant files.
It's still WIP but finally it's in a stage where it's usable. So was wondering if y'all would like to try it (repo in first comment)
Things that are messy:
Installation: Not very clean. I use a bunch of existing open source software like pi, opencode etc.
No UI: It's just a shell command with a simple TUI showing status updates. You need to create your own job.md file (or have an agent create one)
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.