llama.cpp has a clever trick for speeding up KV cache decode
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under developer options.
This is the setting - as far as I can tell based on the description (haven't looked at the code yet), it basically just re-sends all of the tokens generated by the current response to the KV cache rather than waiting for you to prompt the model again to begin decoding. It's certainly a hacky workaround, but it seriously improves general responsiveness when a model turn generates a whole bunch of tokens, or receives a large amount of info from a tool call. To actually enable this, you just need to start your llama-server and head to the WebUI to enable this, and it applies/works across all requests that hit llama-server, not just in their WebUI
In Open-WebUI for example, I used to have to wait 5-30 seconds (which seems like nothing, but when your model is scraping multiple webpages in a single turn, it really adds up) for prompt processing whenever Qwen would read an incredibly large webpage or something similar. However since enabling this option, it's almost instant.
I haven't noticed any real trade-offs as of yet, and I just thought this would be a good little PSA post to put out there. For those wondering, I'm running Qwen3.6-35B-A3B @ MXFP4, fully offloaded to a single RX 7900 XTX, getting about ~100tps with no MTP atm as it's still not compatible with vision encoders. I imagine this would be even better for those of you using the new MTP patches, particularly the one that introduced MTP for PP.
I had no idea this feature existed, so I hope this helps somebody out! Like I said, it's hacky, but it certainly works!
[link] [comments]
More from r/LocalLLaMA
-
opensource music reccomendation / playlist, similar to spotify radio / YT music mix?
May 25
-
Could someone please help explain these results?
May 25
-
how to install llamacpp the better way to wrapping it in python ui (CPU use only) ?
May 25
-
hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)
May 24
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.