r/LocalLLaMA · · 2 min read

llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Running into something annoying with llama-server in router mode (`--models-preset`) and I can't tell if I'm missing a flag or if this is just how it works.

My rig is 2x 3090, 2x 4060 Ti (one's unplugged at the moment, riser got repurposed) and a 5060 Ti. I run a single llama-server router that spawns a child per model on demand, which is great. I usually have a few going at once: a 27B at Q8 across both 3090s for coding and my assistant, a little Gemma 4B on the 5060 Ti doing memory/fact-extraction for the assistant, and a nomic embedder on the same card.

Problem is, every child grabs a CUDA context on all the cards even when the model only lives on one. The Gemma is pinned to the 5060 (`device = CUDA3`, `-ngl 99`) and sure enough it still parks ~256 MiB on each 3090 and ~120 on the 4060 Ti, on top of its actual weights on the 5060.

Normally who cares. But the coding model takes the full 262K context split across both 3090s, which eats them down to ~200 MiB free. Soon as that's loaded, asking for the memory model just dies about 0.2s into the load. CUDA error: out of memory

The 5060 has 15 GB free. It's not the target card that's the problem, it's that the child can't even create its context stub on the maxed 3090s, so the whole load aborts.

I went poking in `server-models.cpp` and it looks like every child just inherits the router's env (`child_env = base_env`), so there's no per-model `CUDA_VISIBLE_DEVICES` I can set in the preset. And `--device` only seems to decide where the layers go, not which cards get a context. ggml inits all of them regardless.

I know I can run a second llama-server with `CUDA_VISIBLE_DEVICES` locked to the 5060 and call it a day, but that permanently walls off the card, and sometimes I want to dump everything and load one giant model across all the cards + RAM. A fixed split kills that.

So is there a flag to make a child skip the GPUs it isn't using, or is the per-card context just expected behaviour? And for anyone running a bunch of models across cards who also occasionally needs the whole rig for one big model, how are you handling it?

submitted by /u/HockeyDadNinja
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA