Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
We all know the struggle of optimizing your VRAM usage: quantized model, quantized kvcache, mmproj off.
I'm often frustrated by the tradeoffs I have to make in these areas. On my RTX 5090, I can fit:
- Qwen3.5-27B @ Q6_K
- Mmproj enabled, MTP off
- q8_0 kvcache
- 150k context
That brings me to 29/32 GB. I could probably optimize a little more to make full use of the remaining space, but it's frustrating finding just the right balance of parameters. Most of the time, I don't need my mmproj, nor do I want my kvcache quantized. Without an mmproj and without quantizing my kvcache, I could probably get 120k+ tokens of context, ballpark. Without an mmproj, I could turn on MTP.
80% of the time, this configuration would be strictly better:
- Qwen3.5-27B @ Q6_K
- Mmproj disabled, MTP on
- f16 kvcache
- ~120k context
But sometimes I need an mmproj, and sometimes I'm working with big contexts and need a quantized kvcache. Changing any of this requires several seconds to fully unload+reload the entire model. If I do this mid-session, it takes even more time because I have to reprocess the entire context.
My inference harness has a swap system built in and I've squeezed as much latency as I can out of that, but it's still far too slow. Waiting a dozen seconds mid-session while I swap configs is No Good, Because I'm Impatient. I want to have my cake and eat it too.
I wasn't sure if all of this was due to technical limitations, so I spent the last week learning about llama.cpp's kvcache, and I can now report that dynamic/on-demand kvcache quantization is fully possible! I've implemented a proof of concept here: https://github.com/ggml-org/llama.cpp/pull/24134
What this does: add an HTTP endpoint POST /requantize_kvcache, which accepts two parameters (ctk, ctv). When called, this:
- reads and deletes your current kvcache
- creates a new, empty kvcache at your desired quantization
- quantizes your previous kvcache and loads it into the new one
Effectively, if your inference harness supports this, you can have most of your session with a full-precision kvcache and selectively quantize it when nearing memory limits. Requantizing takes significantly less time than unloading+reloading the entire model, and with the added bonus that you don't need to reprocess the entire prompt. You can just pick up where you left off, now with more memory to work with.
Right now, this only supports the kvcache for some model architectures (Qwen3, for example, is what I've been using to test). It's incomplete in other ways, too (see the PR for details), but it wouldn't be too much work to wrap up the implementation. I'm hoping to finish this in the next week or two, assuming this is something llama.cpp maintainers want 😅
Other related wishlist items:
- An endpoint to load/unload just the mmproj (or swap between mmproj and MTP)
- A CLI flag like --fit that enables dynamic kvcache quantization without needing to call an API endpoint from your inference harness. This would give you as much context as you can fit on your device, but when you approach the limits of your device, it quantizes your kvcache automatically.
- An endpoint to do prompt processing on demand (though, I think this is just calling completions with n_predict: 0? I need to look into this).
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.