Extension idea: llama-server with custom samplers
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Just an idea and a prototype (made by Qwen3.6-27B-UD-Q6_K_XL via OpenCode) for allowing users to add custom sampling logic to llama-server without having to maintain their own entire fork and without having to make a wrapper that reimplements everything llama-server can do. Included is an example extension that detects and breaks one kind of loop that I've commonly seen with heavily quantized models, where they get stuck repeating the same 1-3 tokens. Other ideas for sampling that aren't in llama.cpp include different sampling parameters during thinking, tool calling, and normal generation; toggling grammars based on context; non-GBNF grammars; guaranteeing that only real tables are referenced in a generated SQL query; redacting PII in the sampler itself; and other experimental general sampling approaches. This was based on the latest master branch after MTP was merged; also works with speculative decoding. Posted for votes here: https://github.com/ggml-org/llama.cpp/discussions/23028 Branch: https://github.com/dpmm99/llama.cpp/tree/master-with-sampling-extensions The example sampler extension is one fairly short file: https://github.com/dpmm99/llama.cpp/blob/master-with-sampling-extensions/examples/sampling-ext/loop-detector.cpp Vulkan Windows x64 release copy for convenience if you want to try it: https://github.com/dpmm99/llama.cpp/releases/tag/dpmm99-0.1 but here's your daily reminder not to trust random executables from the internet. ;) Example command: the extension working in llama-server with Qwen3.6-27B using MTP [link] [comments] |
More from r/LocalLLaMA
-
gemma-4-Ortenzya-The-Creative-Wordsmith-31B-it-uncensored-heretic is Out Now, A Writing Finetune that Aims to Improve Gemma 4 31B it Writing Quality with More Natural English and Better Prose, Good for Creative Writings, Translations and RPs!
May 16
-
Local Qwen 3.6 vs frontier models on a coding primitive: single-file HTML canvas driving animation - results and GIFs
May 16
-
How I started programming differently over the last year. What about you?
May 16
-
GitHub - richardr1126/openreader: An open-source read-along document reader server with high-quality TTS options, synchronized highlighting, and audiobook export for EPUB, PDF, DOCX, TXT, and MD.
May 16
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.