How I'm handling per-agent isolation and environment lifecycle in a harness-agnostic orchestration library
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
This is my third post about designing an orchestration library for agents. I want to share the architecture decisions as I go and to put a solution out there in case you have the same problem, but also to hear what you think.
- Agent's environment: workspace, runtime, and directories
- Configuration files
- Environment Lifecycle
This post is about the lifecycle of an agent's environment, which is something that often gets overlooked, or simplified down to a workspace plus a thread.
So, I wanted to support multiple environments and runtimes, which meant I needed a way to abstract that. I came up with what I defined in the first post:
- workspace: ensures there's a place for the agent to work
- runtime: ensures there's an environment the agent can run in
So an agent has a workspace, which has to be provisioned (provision), and a runtime, which has to be started (start). These steps are naturally sequential, and they give you four states:
not-provisioned: no workspace. Two ways to be here:- never provisioned (no DB record, no letter): the agent doesn't exist as an entity yet, it's just config text.
- previously provisioned, then unprovisioned (record + letter retained): the workspace is gone but the identity stays.
provisioned: workspace and git branch exist on disk. No runtime.started: runtime is "up" in the runtime layer's sense (which differs by runtime). Token issued. Can receive messages. This is when the agent runs. Note that this state, in its purest form, doesn't actually know whether the agent is running (see Note about the agent itself).retired: permanently decommissioned. DB record + letter kept forever (the event log always maps a letter to one agent; letters are never reused).
The important part is that provision and runtime are each behind an interface, and every implementation knows how to start itself, check if it's running, provision itself, and so on. The lifecycle logic doesn't care which one it's talking to.
Note:
start/stopmean different things per runtimeprovisionis runtime-agnostic.
I decided that agents are created at provisioning. There's no separate "create" command. A permanent agent declared in agents.yaml is just config text until provision runs; that's the act that creates the DB record, allocates its letter, and builds the environment.
Reconciliation commands: sync and ensure
sync: reconcile DB downward to match realityensure: bring agents upward to a per-agent floor (not a target) declared inagents.yamlagents: atlas: ensure: started # provision + start if needed backend: ensure: provisioned # provision only, don't start runtime
Notes on provision and idempotency
Provision is idempotent and doubles as the repair operation. Every step is "ensure" / create-if-missing: ensure workspace, ensure branch, ensure artifacts/secrets dirs, run on_provision. Consequences:
- A deleted workspace is restored by re-running
provision - A crash mid-provision is fixed by re-running it
- Never clobber what's present: a workspace that exists is left alone; only a missing one is recreated. This keeps re-provision safe to run anytime.
A re-provision of a previously-provisioned agent reuses its existing record + letter.
Commands table
| Command | Notes |
|---|---|
provision | handles retry/duplicate |
unprovision | --remove-branch, --remove-artifacts, --remove-secrets |
start | loads agents.yaml for config |
stop | no yaml needed |
retire | no yaml needed |
sync | yaml optional; downward only |
ensure | requires yaml; upward to floor |
promote | ephemeral → permanent; writes yaml (only programmatic yaml write) |
Letter
Provision is the creation event. A permanent agent defined in agents.yaml is just config text
until provision runs, there is no separate create command. Provision creates the DB record, allocates the letter, and builds the environment.
- A never-provisioned agent (YAML only) has no record and no letter.
- Once provisioned, the letter persists through
unprovision, re-provision, andretire. It is never released once allocated (the event log must map a letter to one agent forever).
So unprovision returns an agent to not-provisioned with its record + letter retained, and
re-provision reuses that same identity.
on host vs docker
This is more of an implementation detail than a core part of the design, but start and stop mean different things depending on the runtime, because host has no persistent runtime process and docker does. On docker, start is a docker run and the container becomes the persistent thing; on host, start mostly just issues the token and sets the new state.
This means that on host, "is it running?" will just return true, because there's no process to check. Which means host started is really just a bookkeeping claim (the token was issued).
Note about the agent itself
This is something I struggled with, but I came up with the following realization
The agent itself (i.e. the LLM or harness that actually does things) is only a subprocess, so it does not really have a lifecycle. It is working or it isn't.
So I did think of a substate for the start state, but this is not concerning to the environment.
There is a lot to talk about the agent itself, though, and it seems like I'm kind of ignoring it, but it will become a central topic later on. I am setting up all the things around it first.
Note also that I am not trying to replace existing harnesses. opencode, claude code, etc, all work pretty good, and it would be hard to make something even on-par with them. Some already support control remote, sub-agents, etc.
The point is to make a library that makes easy to orchestrate agents, is harness-agnostic, and even allows custom endpoints and running local models (problem for which I already have a draft for), all of which are, to the library, just as running claude code: an agent that you can talk to, make it do things, and communicate with other agents.
The next post is about skills. They've become pretty universal, so I want to support them, but I don't like the current, very liberal approach, which I think carries real security risks. Follow me if you want to know when it's up.
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.