r/LocalLLaMA · · 1 min read

We have sub-agents at home

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

At work I get unfettered access to gpt 5.4 and sonnet, so I'm quite used to spawning sub-agents to go crazy on a repo and split up tasks.

At home I am VRAM poor and like to run the models locally for my own enjoyment. Almost every single sub-agent extension/implementation does not account for any of the restrictions imposed by having 10gb of VRAM and a single slot for a KV cache (thats already quantized).

I already work as a developer, so I qwen3.6-35b-a3b tagged teamed a partially vibe-coded fork of an existing sub-agent repository for pi coding agent.

This is really only relevant if you:

  • Use pi coding agent as your harness
  • Can only run a single LLM at a time with 1 slot via llama.cpp server
  • Want to use sub-agents without fully reprocessing your prompts after the sub-agent is done

Repo is here, feel free to use it or fork it idc. I am also interested in how others around here have dealt with sub-agents on a purely local and VRAM constrained setup. I was also planning to add the ability for sub-agents to be spawned with no previous context, and manage the saving and storing the main context via `--slot-save-path` and the `slots` endpoint. But the `.bin` files produced from that are pretty fat lol

Last thing, I've really been enjoying MTP in the main llama.cpp branch and have been getting pretty solid performance from the Apex Qwen variant. Able to run at 175-200k context with q_8 kv. Getting 200-300 pp and 25-40 tps depending on draft hit rates.

submitted by /u/sisyphus-cycle
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA