Is there a way to disable reasoning per request in llama.cpp's llama-server, while leaving it on by default?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Title. I've got a llama.cpp server running a model being accessed across a number of scripts, and some of them are easier for the model than others, and those easier ones are also latency dependent. Rather than host two different servers with different parameters, I'd rather just send something along with the prompt to disable it.
If I must host multiple servers, am I able to host two servers for the same model but only have the model loaded in memory once? VRAM limited, like most of you I'm sure.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.