r/LocalLLaMA · June 25, 2026 · 5 min read

How I'm handling per-agent isolation and environment lifecycle in a harness-agnostic orchestration library

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

This is my third post about designing an orchestration library for agents. I want to share the architecture decisions as I go and to put a solution out there in case you have the same problem, but also to hear what you think.

Agent's environment: workspace, runtime, and directories
Configuration files
Environment Lifecycle

This post is about the lifecycle of an agent's environment, which is something that often gets overlooked, or simplified down to a workspace plus a thread.

So, I wanted to support multiple environments and runtimes, which meant I needed a way to abstract that. I came up with what I defined in the first post:

workspace: ensures there's a place for the agent to work
runtime: ensures there's an environment the agent can run in

So an agent has a workspace, which has to be provisioned (provision), and a runtime, which has to be started (start). These steps are naturally sequential, and they give you four states:

not-provisioned: no workspace. Two ways to be here:
- never provisioned (no DB record, no letter): the agent doesn't exist as an entity yet, it's just config text.
- previously provisioned, then unprovisioned (record + letter retained): the workspace is gone but the identity stays.
provisioned: workspace and git branch exist on disk. No runtime.
started: runtime is "up" in the runtime layer's sense (which differs by runtime). Token issued. Can receive messages. This is when the agent runs. Note that this state, in its purest form, doesn't actually know whether the agent is running (see Note about the agent itself).
retired: permanently decommissioned. DB record + letter kept forever (the event log always maps a letter to one agent; letters are never reused).

The important part is that provision and runtime are each behind an interface, and every implementation knows how to start itself, check if it's running, provision itself, and so on. The lifecycle logic doesn't care which one it's talking to.

Note:

start/stop mean different things per runtime
provision is runtime-agnostic.

I decided that agents are created at provisioning. There's no separate "create" command. A permanent agent declared in agents.yaml is just config text until provision runs; that's the act that creates the DB record, allocates its letter, and builds the environment.

Reconciliation commands: sync and ensure

sync: reconcile DB downward to match reality
ensure: bring agents upward to a per-agent floor (not a target) declared in agents.yaml

agents: atlas: ensure: started # provision + start if needed backend: ensure: provisioned # provision only, don't start runtime

Notes on provision and idempotency

Provision is idempotent and doubles as the repair operation. Every step is "ensure" / create-if-missing: ensure workspace, ensure branch, ensure artifacts/secrets dirs, run on_provision. Consequences:

A deleted workspace is restored by re-running provision
A crash mid-provision is fixed by re-running it
Never clobber what's present: a workspace that exists is left alone; only a missing one is recreated. This keeps re-provision safe to run anytime.

A re-provision of a previously-provisioned agent reuses its existing record + letter.

Commands table

Command	Notes
`provision`	handles retry/duplicate
`unprovision`	`--remove-branch`, `--remove-artifacts`, `--remove-secrets`
`start`	loads agents.yaml for config
`stop`	no yaml needed
`retire`	no yaml needed
`sync`	yaml optional; downward only
`ensure`	requires yaml; upward to floor
`promote`	ephemeral → permanent; writes yaml (only programmatic yaml write)

Letter

Provision is the creation event. A permanent agent defined in agents.yaml is just config text
until provision runs, there is no separate create command. Provision creates the DB record, allocates the letter, and builds the environment.

A never-provisioned agent (YAML only) has no record and no letter.
Once provisioned, the letter persists through unprovision, re-provision, and retire. It is never released once allocated (the event log must map a letter to one agent forever).

So unprovision returns an agent to not-provisioned with its record + letter retained, and
re-provision reuses that same identity.

on host vs docker

This is more of an implementation detail than a core part of the design, but start and stop mean different things depending on the runtime, because host has no persistent runtime process and docker does. On docker, start is a docker run and the container becomes the persistent thing; on host, start mostly just issues the token and sets the new state.

This means that on host, "is it running?" will just return true, because there's no process to check. Which means host started is really just a bookkeeping claim (the token was issued).

Note about the agent itself

This is something I struggled with, but I came up with the following realization

The agent itself (i.e. the LLM or harness that actually does things) is only a subprocess, so it does not really have a lifecycle. It is working or it isn't.

So I did think of a substate for the start state, but this is not concerning to the environment.
There is a lot to talk about the agent itself, though, and it seems like I'm kind of ignoring it, but it will become a central topic later on. I am setting up all the things around it first.

Note also that I am not trying to replace existing harnesses. opencode, claude code, etc, all work pretty good, and it would be hard to make something even on-par with them. Some already support control remote, sub-agents, etc.

The point is to make a library that makes easy to orchestrate agents, is harness-agnostic, and even allows custom endpoints and running local models (problem for which I already have a draft for), all of which are, to the library, just as running claude code: an agent that you can talk to, make it do things, and communicate with other agents.

The next post is about skills. They've become pretty universal, so I want to support them, but I don't like the current, very liberal approach, which I think carries real security risks. Follow me if you want to know when it's up.

submitted by /u/facu_75
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.