Hugging Face · December 11, 2025 · 10 min read

New in llama.cpp: Model Management

Mirrored from Hugging Face for archival readability. Support the source by reading on the original site.

Like Read original ↗

Back to Articles

New in llama.cpp: Model Management

Team Article Published December 11, 2025

Upvote

134

llama.cpp server now ships with router mode, which lets you dynamically load, unload, and switch between multiple models without restarting.

Reminder: llama.cpp server is a lightweight, OpenAI-compatible HTTP server for running LLMs locally.

This feature was a popular request to bring Ollama-style model management to llama.cpp. It uses a multi-process architecture where each model runs in its own process, so if one model crashes, others remain unaffected.

Quick Start

Start the server in router mode by not specifying a model:

llama-server

This auto-discovers models from your llama.cpp cache (LLAMA_CACHE or ~/.cache/llama.cpp). If you've previously downloaded models via llama-server -hf user/model, they'll be available automatically.

You can also point to a local directory of GGUF files:

llama-server --models-dir ./my-models

Features

Auto-discovery: Scans your llama.cpp cache (default) or a custom --models-dir folder for GGUF files
On-demand loading: Models load automatically when first requested
LRU eviction: When you hit --models-max (default: 4), the least-recently-used model unloads
Request routing: The model field in your request determines which model handles it

Examples

Chat with a specific model

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

On the first request, the server automatically loads the model into memory (loading time depends on model size). Subsequent requests to the same model are instant since it's already loaded.

List available models

curl http://localhost:8080/models

Returns all discovered models with their status (loaded, loading, or unloaded).

Manually load a model

curl -X POST http://localhost:8080/models/load \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model.gguf"}'

Unload a model to free VRAM

curl -X POST http://localhost:8080/models/unload \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model.gguf"}'

Key Options

Flag	Description
`--models-dir PATH`	Directory containing your GGUF files
`--models-max N`	Max models loaded simultaneously (default: 4)
`--no-models-autoload`	Disable auto-loading; require explicit `/models/load` calls

All model instances inherit settings from the router:

llama-server --models-dir ./models -c 8192 -ngl 99

All loaded models will use 8192 context and full GPU offload. You can also define per-model settings using presets:

llama-server --models-preset config.ini

[my-model]
model = /path/to/model.gguf
ctx-size = 65536
temp = 0.7

Also available in the Web UI

The built-in web UI also supports model switching. Just select a model from the dropdown and it loads automatically.

Join the Conversation

We hope this feature makes it easier to A/B test different model versions, run multi-tenant deployments, or simply switch models during development without restarting the server.

Have questions or feedback? Drop a comment below or open an issue on GitHub.

Using OCR models with llama.cpp

April 10, 2026

New in llama.cpp: Anthropic Messages API

January 19, 2026

Community

Mmproj support?\n","updatedAt":"2025-12-11T19:14:03.697Z","author":{"_id":"647f6e094da748fc4abddac4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/fRIuBNA7z9d2-PRQx_AaK.png","fullname":"Bukit Sorrento","name":"bukit","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3858118951320648},"editors":["bukit"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/fRIuBNA7z9d2-PRQx_AaK.png"],"reactions":[{"reaction":"👀","users":["hroch","fujohnwang"],"count":2},{"reaction":"👍","users":["Chaoses-Ib"],"count":1}],"isReport":false},"replies":[{"id":"693bd7e5429f88442aba1ae0","author":{"_id":"66fdde69da291ca087e65a8e","avatarUrl":"/avatars/dc871417c5c959c2ec845a185ff81348.svg","fullname":"Shawn Beltz","name":"sbeltz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-12-12T08:52:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Supported via presets.ini, where you can specify the mmproj (and other long and short arguments) per model.","html":"Supported via presets.ini, where you can specify the mmproj (and other long and short arguments) per model.\n","updatedAt":"2025-12-12T08:52:53.571Z","author":{"_id":"66fdde69da291ca087e65a8e","avatarUrl":"/avatars/dc871417c5c959c2ec845a185ff81348.svg","fullname":"Shawn Beltz","name":"sbeltz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3627573251724243},"editors":["sbeltz"],"editorAvatarUrls":["/avatars/dc871417c5c959c2ec845a185ff81348.svg"],"reactions":[{"reaction":"🔥","users":["bukit","echos-keeper","RooooM2"],"count":3}],"isReport":false,"parentCommentId":"693b17fb30a8fb1087cfa5f0"}},{"id":"693be048d0e9ccbdaba6382c","author":{"_id":"6449cbf4eb7db8f70fb8a396","avatarUrl":"/avatars/f7540cf1ef2370e402df13b3587384f9.svg","fullname":"grailfinder","name":"grailfinder","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-12-12T09:28:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"https://github.com/ggml-org/llama.cpp/tree/master/tools/server#using-multiple-models\n```\nmodels_directory\n │\n │ # single file\n ├─ llama-3.2-1b-Q4_K_M.gguf\n ├─ Qwen3-8B-Q4_K_M.gguf\n │\n │ # multimodal\n ├─ gemma-3-4b-it-Q8_0\n │ ├─ gemma-3-4b-it-Q8_0.gguf\n │ └─ mmproj-F16.gguf # file name must start with \"mmproj\"\n```","html":"<a href=\"https://github.com/ggml-org/llama.cpp/tree/master/tools/server#using-multiple-models\" rel=\"nofollow\">https://github.com/ggml-org/llama.cpp/tree/master/tools/server#using-multiple-models</a>\n<pre><code>models_directory\n │\n │ # single file\n ├─ llama-3.2-1b-Q4_K_M.gguf\n ├─ Qwen3-8B-Q4_K_M.gguf\n │\n │ # multimodal\n ├─ gemma-3-4b-it-Q8_0\n │ ├─ gemma-3-4b-it-Q8_0.gguf\n │ └─ mmproj-F16.gguf # file name must start with \"mmproj\"\n</code></pre>\n","updatedAt":"2025-12-12T09:28:40.933Z","author":{"_id":"6449cbf4eb7db8f70fb8a396","avatarUrl":"/avatars/f7540cf1ef2370e402df13b3587384f9.svg","fullname":"grailfinder","name":"grailfinder","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.2843471169471741},"editors":["grailfinder"],"editorAvatarUrls":["/avatars/f7540cf1ef2370e402df13b3587384f9.svg"],"reactions":[{"reaction":"🤝","users":["bukit"],"count":1}],"isReport":false,"parentCommentId":"693b17fb30a8fb1087cfa5f0"}},{"id":"693c1e991a2029b5be317090","author":{"_id":"63ca214abedad7e2bf1d1517","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674191139776-noauth.png","fullname":"Xuan-Son Nguyen","name":"ngxson","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":605,"isUserFollowing":false},"createdAt":"2025-12-12T13:54:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"mmproj is also automatically selected for cached models, downloaded via the `-hf user/model` option","html":"mmproj is also automatically selected for cached models, downloaded via the <code>-hf user/model</code> option\n","updatedAt":"2025-12-12T13:54:33.415Z","author":{"_id":"63ca214abedad7e2bf1d1517","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674191139776-noauth.png","fullname":"Xuan-Son Nguyen","name":"ngxson","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":605,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8438642024993896},"editors":["ngxson"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674191139776-noauth.png"],"reactions":[{"reaction":"❤️","users":["bukit"],"count":1}],"isReport":false,"parentCommentId":"693b17fb30a8fb1087cfa5f0"}},{"id":"693d0e71a45cf6ced178383b","author":{"_id":"686c460ba3fc457ad14ab6f8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/686c460ba3fc457ad14ab6f8/QFa608PO_fSX8WCmi_T1s.jpeg","fullname":"Tyler Williams","name":"unmodeled-tyler","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":109,"isUserFollowing":false},"createdAt":"2025-12-13T06:57:53.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","hiddenReason":"Off-Topic","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-12-13T06:58:36.909Z","author":{"_id":"686c460ba3fc457ad14ab6f8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/686c460ba3fc457ad14ab6f8/QFa608PO_fSX8WCmi_T1s.jpeg","fullname":"Tyler Williams","name":"unmodeled-tyler","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":109,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[],"parentCommentId":"693b17fb30a8fb1087cfa5f0"}}]},{"id":"693bd9ffdb3bf2535fd0cf48","author":{"_id":"66fdde69da291ca087e65a8e","avatarUrl":"/avatars/dc871417c5c959c2ec845a185ff81348.svg","fullname":"Shawn Beltz","name":"sbeltz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-12-12T09:01:51.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Awesome new feature! Can model selection be done on something other than requested model name? Like maybe specify the ranking in presets.ini, and then the highest ranked model that can satisfy the request will be the default. So maybe one model is best for short context, another (or the same with other settings) for when the context gets too long, and another when image input is required.","html":"Awesome new feature! Can model selection be done on something other than requested model name? Like maybe specify the ranking in presets.ini, and then the highest ranked model that can satisfy the request will be the default. So maybe one model is best for short context, another (or the same with other settings) for when the context gets too long, and another when image input is required.\n","updatedAt":"2025-12-12T09:01:51.466Z","author":{"_id":"66fdde69da291ca087e65a8e","avatarUrl":"/avatars/dc871417c5c959c2ec845a185ff81348.svg","fullname":"Shawn Beltz","name":"sbeltz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8926734924316406},"editors":["sbeltz"],"editorAvatarUrls":["/avatars/dc871417c5c959c2ec845a185ff81348.svg"],"reactions":[{"reaction":"➕","users":["israellaguan","drw001","GeorgSim"],"count":3}],"isReport":false}},{"id":"693c387ddb3bf2535fd0cf56","author":{"_id":"655da2b8668b64adf1438cbc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655da2b8668b64adf1438cbc/R25eK22XbzyrhhYrhNnCt.jpeg","fullname":"v1k","name":"xbruce22","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false},"createdAt":"2025-12-12T15:45:01.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is good addition, Thank you.","html":"This is good addition, Thank you.\n","updatedAt":"2025-12-12T15:45:01.405Z","author":{"_id":"655da2b8668b64adf1438cbc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655da2b8668b64adf1438cbc/R25eK22XbzyrhhYrhNnCt.jpeg","fullname":"v1k","name":"xbruce22","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9924921989440918},"editors":["xbruce22"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/655da2b8668b64adf1438cbc/R25eK22XbzyrhhYrhNnCt.jpeg"],"reactions":[{"reaction":"👍","users":["RooooM2"],"count":1}],"isReport":false}},{"id":"693c57ea6107ec9c17bb2879","author":{"_id":"65a488b5224f96d8cc3754fc","avatarUrl":"/avatars/cf21cf2c8f1c9d5a8fb35761acdef04b.svg","fullname":"Emin Temiz","name":"etemiz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":99,"isUserFollowing":false},"createdAt":"2025-12-12T17:59:06.000Z","type":"comment","data":{"edited":true,"hidden":false,"latest":{"raw":"what is the best way to get <think> </think> and the tokens in between? openAI library is removing them.. i want to run llama-server in console and talk to it using a python library that does not remove the thinking tokens.\n\ni checked the llama-cpp-python but it does not have that.","html":"what is the best way to get <think> </think> and the tokens in between? openAI library is removing them.. i want to run llama-server in console and talk to it using a python library that does not remove the thinking tokens.\ni checked the llama-cpp-python but it does not have that.\n","updatedAt":"2025-12-12T18:01:42.014Z","author":{"_id":"65a488b5224f96d8cc3754fc","avatarUrl":"/avatars/cf21cf2c8f1c9d5a8fb35761acdef04b.svg","fullname":"Emin Temiz","name":"etemiz","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":99,"isUserFollowing":false}},"numEdits":4,"identifiedLanguage":{"language":"en","probability":0.956532895565033},"editors":["etemiz"],"editorAvatarUrls":["/avatars/cf21cf2c8f1c9d5a8fb35761acdef04b.svg"],"reactions":[],"isReport":false},"replies":[{"id":"69418941cd121096018fdaed","author":{"_id":"655da2b8668b64adf1438cbc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655da2b8668b64adf1438cbc/R25eK22XbzyrhhYrhNnCt.jpeg","fullname":"v1k","name":"xbruce22","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false},"createdAt":"2025-12-16T16:30:57.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"llama-server by default in most implementation keeps the reasoning content in `reasoning_content` variable in response attribute. You can get it from there. Otherwise use reasoning-format flag and pass DeepSeek value to get pure <think> tokens","html":"llama-server by default in most implementation keeps the reasoning content in <code>reasoning_content</code> variable in response attribute. You can get it from there. Otherwise use reasoning-format flag and pass DeepSeek value to get pure tokens\n","updatedAt":"2025-12-16T16:30:57.881Z","author":{"_id":"655da2b8668b64adf1438cbc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655da2b8668b64adf1438cbc/R25eK22XbzyrhhYrhNnCt.jpeg","fullname":"v1k","name":"xbruce22","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7589544653892517},"editors":["xbruce22"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/655da2b8668b64adf1438cbc/R25eK22XbzyrhhYrhNnCt.jpeg"],"reactions":[{"reaction":"❤️","users":["etemiz","mechanicmuthu"],"count":2}],"isReport":false,"parentCommentId":"693c57ea6107ec9c17bb2879"}}]},{"id":"693cd851a1d453e27f52b22c","author":{"_id":"630c2a12910e17bbfeb1ce18","avatarUrl":"/avatars/0c1dd3ebc0e2c8ecf6c771d3728accf9.svg","fullname":"Razvan","name":"razvanab","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false},"createdAt":"2025-12-13T03:06:57.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Now I can use llama.cpp all the time. A big thank you to the devs.","html":"Now I can use llama.cpp all the time. A big thank you to the devs.\n","updatedAt":"2025-12-13T03:06:57.498Z","author":{"_id":"630c2a12910e17bbfeb1ce18","avatarUrl":"/avatars/0c1dd3ebc0e2c8ecf6c771d3728accf9.svg","fullname":"Razvan","name":"razvanab","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8091885447502136},"editors":["razvanab"],"editorAvatarUrls":["/avatars/0c1dd3ebc0e2c8ecf6c771d3728accf9.svg"],"reactions":[{"reaction":"😎","users":["pointaveugle"],"count":1}],"isReport":false}},{"id":"693ce302a45cf6ced1783833","author":{"_id":"66fdde69da291ca087e65a8e","avatarUrl":"/avatars/dc871417c5c959c2ec845a185ff81348.svg","fullname":"Shawn Beltz","name":"sbeltz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2025-12-13T03:52:34.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Is there currently a way to have a \"default\" model if the request doesn't specify? Could be the currently loaded model or a specific model. (Just noticed one of my apps broke because it's used to llama-server not requiring a model name.)","html":"Is there currently a way to have a \"default\" model if the request doesn't specify? Could be the currently loaded model or a specific model. (Just noticed one of my apps broke because it's used to llama-server not requiring a model name.)\n","updatedAt":"2025-12-13T03:52:34.545Z","author":{"_id":"66fdde69da291ca087e65a8e","avatarUrl":"/avatars/dc871417c5c959c2ec845a185ff81348.svg","fullname":"Shawn Beltz","name":"sbeltz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9123566746711731},"editors":["sbeltz"],"editorAvatarUrls":["/avatars/dc871417c5c959c2ec845a185ff81348.svg"],"reactions":[{"reaction":"➕","users":["israellaguan"],"count":1}],"isReport":false},"replies":[{"id":"69670316648f6122d5acfb82","author":{"_id":"64a3c25d73f3ad435c3d41e6","avatarUrl":"/avatars/3e1bc5b737b6f6a159197228f34815a2.svg","fullname":"Jim Jones","name":"milksteak1111","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2026-01-14T02:44:38.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"# This seems to work\n[DEFAULT]\nport = 8080\nn-gpu-layers = -1\ndevice = 0\nflash-attn = on\nchat-template = jinja\nmodels-max = 4\n","html":"<h1 class=\"relative group flex items-baseline\">\n\t<a id=\"this-seems-to-work\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#this-seems-to-work\" rel=\"nofollow\">\n\t\t<svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg>\n\t</a>\n\t\n\t\tThis seems to work\n\t\n</h1>\n[DEFAULT] port = 8080 n-gpu-layers = -1 device = 0 flash-attn = on chat-template = jinja models-max = 4\n","updatedAt":"2026-01-14T02:44:38.971Z","author":{"_id":"64a3c25d73f3ad435c3d41e6","avatarUrl":"/avatars/3e1bc5b737b6f6a159197228f34815a2.svg","fullname":"Jim Jones","name":"milksteak1111","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.895008385181427},"editors":["milksteak1111"],"editorAvatarUrls":["/avatars/3e1bc5b737b6f6a159197228f34815a2.svg"],"reactions":[],"isReport":false,"parentCommentId":"693ce302a45cf6ced1783833"}}]},{"id":"693eada2a45cf6ced1783860","author":{"_id":"6352f0877489e19b128963c0","avatarUrl":"/avatars/fe25696ece5fdf135c5809b158edeec4.svg","fullname":"Erik","name":"eribob","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2025-12-14T12:29:22.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Does it unload the current model if VRAM is full, to allow swapping to a new model? ","html":"Does it unload the current model if VRAM is full, to allow swapping to a new model? \n","updatedAt":"2025-12-14T12:29:22.609Z","author":{"_id":"6352f0877489e19b128963c0","avatarUrl":"/avatars/fe25696ece5fdf135c5809b158edeec4.svg","fullname":"Erik","name":"eribob","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8235758543014526},"editors":["eribob"],"editorAvatarUrls":["/avatars/fe25696ece5fdf135c5809b158edeec4.svg"],"reactions":[{"reaction":"👍","users":["Chaoses-Ib","byob75"],"count":2},{"reaction":"👀","users":["ELigoP"],"count":1}],"isReport":false}},{"id":"694037884d3a55b8d4ec7c65","author":{"_id":"64548986cd09ceba0e1709cb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64548986cd09ceba0e1709cb/muGiatjmPfzxYb3Rjcqas.jpeg","fullname":"www.minds.com/jelyazko/","name":"21world","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":817,"isUserFollowing":false},"createdAt":"2025-12-15T16:30:00.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"fun ideas , add personal avatar and p2p social network also emule p2p models storage ","html":"fun ideas , add personal avatar and p2p social network also emule p2p models storage \n","updatedAt":"2025-12-15T16:30:00.613Z","author":{"_id":"64548986cd09ceba0e1709cb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64548986cd09ceba0e1709cb/muGiatjmPfzxYb3Rjcqas.jpeg","fullname":"www.minds.com/jelyazko/","name":"21world","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":817,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7730856537818909},"editors":["21world"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64548986cd09ceba0e1709cb/muGiatjmPfzxYb3Rjcqas.jpeg"],"reactions":[],"isReport":false}},{"id":"6940387bc7e128b6723e5798","author":{"_id":"64548986cd09ceba0e1709cb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64548986cd09ceba0e1709cb/muGiatjmPfzxYb3Rjcqas.jpeg","fullname":"www.minds.com/jelyazko/","name":"21world","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":817,"isUserFollowing":false},"createdAt":"2025-12-15T16:34:03.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","hiddenReason":"Off-Topic","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2025-12-22T13:09:20.913Z","author":{"_id":"64548986cd09ceba0e1709cb","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64548986cd09ceba0e1709cb/muGiatjmPfzxYb3Rjcqas.jpeg","fullname":"www.minds.com/jelyazko/","name":"21world","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":817,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}},{"id":"694e6381cd7bb2956d912b9a","author":{"_id":"6758a9850e3fff481964ca6d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/EolfJfjW25hC4Bt_hCPq8.png","fullname":"Jean Louis","name":"JLouisBiz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":91,"isUserFollowing":false},"createdAt":"2025-12-26T10:29:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Hey there! Just wanted to drop a quick note saying I'm really digging the new router mode in llama.cpp server. It's a game-changer for me, especially when I need to switch between different models. The auto-discovery of models and LRU eviction is pretty neat – no more manual updates or restarts needed. It's like having a dynamic model manager on-the-fly. And the request routing part? Brilliant! Makes my workflow with dmenu smoother. Check out the full experience and check out my dmenu launcher script on the project's GitHub: https://gitea.com/gnusupport/LLM-Helpers/src/branch/main/bin/rcd-llm-dmenu-launcher.sh\n\nIt's a win for sure.","html":"Hey there! Just wanted to drop a quick note saying I'm really digging the new router mode in llama.cpp server. It's a game-changer for me, especially when I need to switch between different models. The auto-discovery of models and LRU eviction is pretty neat – no more manual updates or restarts needed. It's like having a dynamic model manager on-the-fly. And the request routing part? Brilliant! Makes my workflow with dmenu smoother. Check out the full experience and check out my dmenu launcher script on the project's GitHub: <a href=\"https://gitea.com/gnusupport/LLM-Helpers/src/branch/main/bin/rcd-llm-dmenu-launcher.sh\" rel=\"nofollow\">https://gitea.com/gnusupport/LLM-Helpers/src/branch/main/bin/rcd-llm-dmenu-launcher.sh</a>\nIt's a win for sure.\n","updatedAt":"2025-12-26T10:29:21.650Z","author":{"_id":"6758a9850e3fff481964ca6d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/EolfJfjW25hC4Bt_hCPq8.png","fullname":"Jean Louis","name":"JLouisBiz","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":91,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8885526657104492},"editors":["JLouisBiz"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/EolfJfjW25hC4Bt_hCPq8.png"],"reactions":[],"isReport":false}},{"id":"695927029918266addc03a7e","author":{"_id":"641e5f295f274a0a92c3082a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641e5f295f274a0a92c3082a/5hDWXza2OsEo6r6HkfZBt.png","fullname":"Melvin Vivas","name":"melvindave","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":37,"isUserFollowing":false},"createdAt":"2026-01-03T14:26:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"thanks for the update! does it now behave like ollama?","html":"thanks for the update! does it now behave like ollama?\n","updatedAt":"2026-01-03T14:26:10.520Z","author":{"_id":"641e5f295f274a0a92c3082a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/641e5f295f274a0a92c3082a/5hDWXza2OsEo6r6HkfZBt.png","fullname":"Melvin Vivas","name":"melvindave","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":37,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7073429822921753},"editors":["melvindave"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/641e5f295f274a0a92c3082a/5hDWXza2OsEo6r6HkfZBt.png"],"reactions":[],"isReport":false}},{"id":"69ad89195a16ee5ecfd75fd7","author":{"_id":"66a04da94bb7945d6aa74219","avatarUrl":"/avatars/0c5eb65c6ce9bc2fcff9b90a75b10e1b.svg","fullname":"Morgan","name":"MagicMorgan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-03-08T14:35:05.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you so much for this, it's great!","html":"Thank you so much for this, it's great!\n","updatedAt":"2026-03-08T14:35:05.643Z","author":{"_id":"66a04da94bb7945d6aa74219","avatarUrl":"/avatars/0c5eb65c6ce9bc2fcff9b90a75b10e1b.svg","fullname":"Morgan","name":"MagicMorgan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9852364659309387},"editors":["MagicMorgan"],"editorAvatarUrls":["/avatars/0c5eb65c6ce9bc2fcff9b90a75b10e1b.svg"],"reactions":[],"isReport":false}},{"id":"69b1efa220bc7254cf4f318f","author":{"_id":"698a44d5e33dc8b5a68c2e1f","avatarUrl":"/avatars/69983ce5cda94234f14cdc11891ad31b.svg","fullname":"waimond fung","name":"akeni23","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-03-11T22:41:38.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I want to specifically pin models to a specific GPU (I have multiple) is that possible?","html":"I want to specifically pin models to a specific GPU (I have multiple) is that possible?\n","updatedAt":"2026-03-11T22:41:38.771Z","author":{"_id":"698a44d5e33dc8b5a68c2e1f","avatarUrl":"/avatars/69983ce5cda94234f14cdc11891ad31b.svg","fullname":"waimond fung","name":"akeni23","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9762904047966003},"editors":["akeni23"],"editorAvatarUrls":["/avatars/69983ce5cda94234f14cdc11891ad31b.svg"],"reactions":[],"isReport":false}}],"status":"open","isReport":false,"pinned":false,"locked":false,"collection":"community_blogs"},"contextAuthors":["ngxson","victor"],"primaryEmailConfirmed":false,"discussionRole":0,"acceptLanguages":["en"],"withThread":true,"cardDisplay":false,"repoDiscussionsLocked":false}">

bukit

Dec 11, 2025

Mmproj support?

sbeltz

Dec 12, 2025

Supported via presets.ini, where you can specify the mmproj (and other long and short arguments) per model.

sbeltz

Dec 12, 2025

Awesome new feature! Can model selection be done on something other than requested model name? Like maybe specify the ranking in presets.ini, and then the highest ranked model that can satisfy the request will be the default. So maybe one model is best for short context, another (or the same with other settings) for when the context gets too long, and another when image input is required.

xbruce22

Dec 12, 2025

This is good addition, Thank you.

etemiz

Dec 12, 2025

•

edited Dec 12, 2025

what is the best way to get <think> </think> and the tokens in between? openAI library is removing them.. i want to run llama-server in console and talk to it using a python library that does not remove the thinking tokens.

i checked the llama-cpp-python but it does not have that.

xbruce22

Dec 16, 2025

llama-server by default in most implementation keeps the reasoning content in reasoning_content variable in response attribute. You can get it from there. Otherwise use reasoning-format flag and pass DeepSeek value to get pure tokens

razvanab

Dec 13, 2025

Now I can use llama.cpp all the time. A big thank you to the devs.

sbeltz

Dec 13, 2025

Is there currently a way to have a "default" model if the request doesn't specify? Could be the currently loaded model or a specific model. (Just noticed one of my apps broke because it's used to llama-server not requiring a model name.)

milksteak1111

Jan 14

This seems to work

[DEFAULT]
port = 8080
n-gpu-layers = -1
device = 0
flash-attn = on
chat-template = jinja
models-max = 4

eribob

Dec 14, 2025

Does it unload the current model if VRAM is full, to allow swapping to a new model?

21world

Dec 15, 2025

fun ideas , add personal avatar and p2p social network also emule p2p models storage

21world

Dec 15, 2025

This comment has been hidden (marked as Off-Topic)

JLouisBiz

Dec 26, 2025

Hey there! Just wanted to drop a quick note saying I'm really digging the new router mode in llama.cpp server. It's a game-changer for me, especially when I need to switch between different models. The auto-discovery of models and LRU eviction is pretty neat – no more manual updates or restarts needed. It's like having a dynamic model manager on-the-fly. And the request routing part? Brilliant! Makes my workflow with dmenu smoother. Check out the full experience and check out my dmenu launcher script on the project's GitHub: https://gitea.com/gnusupport/LLM-Helpers/src/branch/main/bin/rcd-llm-dmenu-launcher.sh

It's a win for sure.