anybody got llama-swap working answering concurrent requests for a single model?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
EDIT: Solved. Works after the update, thanks everyone.
been trying this out for a bit, I have qwen 3.6 35b a3b running via this config:
qwen-36-35b-a3b: aliases: - qwen-a3b cmd: | env __GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 DRI_PRIME=1 \ llama-server \ -m "${baseModelDir}/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf" \ --mmproj "${baseModelDir}/a3b-mmproj-BF16.gguf" \ --host 0.0.0.0 \ --port "${PORT}" \ -c 262144 \ -sm row \ -ngl 99 \ -ctk q8_0 \ -ctv q8_0 \ -mg 0 \ -np 2 \ -fa on \ --spec-type draft-mtp --spec-draft-n-max 2 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --presence-penalty 0.0 \ --repeat-penalty 1.1 \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 I understand sm row + ngl makes it distribute to both GPUs, and np 2 makes it so I can have concurrent calls, and it works just fine when I run the command myself, I can open llama-server's GUI and execute 2 concurrent calls, BUT when running via llama-swap the second request will always wait until the first request resolves.
There is a configuration parameter for concurrency on llama-swap but it defaults to 10 (defaults to 0 but internally resolves to 10 *1), so that's also not it, perplexity didn't find any way either, couldn't find much on the issue tracker... Most concurrency things I find is for running different models, using the matrix and such, which is not what I want, don't want to run 2 llamacpp instances, I think running a single one here should be the optimal solution as I understand would use less GPU memory.
Anyone got something like this running?
*1
# concurrencyLimit: overrides the allowed number of active parallel requests to a model # - optional, default: 0 # - useful for limiting the number of active parallel requests a model can process # - must be set per model # - any number greater than 0 will override the internal default value of 10 # - any requests that exceeds the limit will receive an HTTP 429 Too Many Requests response # - recommended to be omitted and the default used concurrencyLimit: 0 EDIT: Solved. Works after the update, thanks everyone.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.