r/LocalLLaMA · May 30, 2026 · 2 min read

anybody got llama-swap working answering concurrent requests for a single model?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

EDIT: Solved. Works after the update, thanks everyone.

been trying this out for a bit, I have qwen 3.6 35b a3b running via this config:

qwen-36-35b-a3b: aliases: - qwen-a3b cmd: | env __GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 DRI_PRIME=1 \ llama-server \ -m "${baseModelDir}/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" \ --mmproj "${baseModelDir}/a3b-mmproj-BF16.gguf" \ --host 0.0.0.0 \ --port "${PORT}" \ -c 262144 \ -sm row \ -ngl 99 \ -ctk q8_0 \ -ctv q8_0 \ -mg 0 \ -np 2 \ -fa on \ --spec-type draft-mtp --spec-draft-n-max 2 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --presence-penalty 0.0 \ --repeat-penalty 1.1 \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00

I understand sm row + ngl makes it distribute to both GPUs, and np 2 makes it so I can have concurrent calls, and it works just fine when I run the command myself, I can open llama-server's GUI and execute 2 concurrent calls, BUT when running via llama-swap the second request will always wait until the first request resolves.

There is a configuration parameter for concurrency on llama-swap but it defaults to 10 (defaults to 0 but internally resolves to 10 *1), so that's also not it, perplexity didn't find any way either, couldn't find much on the issue tracker... Most concurrency things I find is for running different models, using the matrix and such, which is not what I want, don't want to run 2 llamacpp instances, I think running a single one here should be the optimal solution as I understand would use less GPU memory.

Anyone got something like this running?

# concurrencyLimit: overrides the allowed number of active parallel requests to a model # - optional, default: 0 # - useful for limiting the number of active parallel requests a model can process # - must be set per model # - any number greater than 0 will override the internal default value of 10 # - any requests that exceeds the limit will receive an HTTP 429 Too Many Requests response # - recommended to be omitted and the default used concurrencyLimit: 0

EDIT: Solved. Works after the update, thanks everyone.

submitted by /u/sickmartian
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA