Llamacpp server : How do the -np and -c flags interact?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I've been using lm studio for a few months. I want to try hermes agents with Qwen 3.6 MoE, so I'm switching to llama.cpp and I don't understand well how the server slots -np and the context size -c interact.
The context for each parallel client appears to be equally distributed across server slots (so each client is allowed c / np context).
I have some questions:
- What are the consequences of launching a server with a greater context -c than what the model allows?
- What if c / np is greater than the model max context? Are there any negative to that regarding model performance?
- If a rig allows to allocate twice the context max size in vram, is it twice energy and time efficient to serve two agents in parallel rather than sequentially?
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.