r/LocalLLaMA · · 2 min read

llama.cpp has a clever trick for speeding up KV cache decode

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

So, I use llama-server as my endpoint to run local models and connect them to Open-WebUI, Hermes, and OpenCode. But since llama.cpp's webUI has been receiving a lot of updates, I took a look at its settings and noticed a particular one under developer options.

This is the setting - as far as I can tell based on the description (haven't looked at the code yet), it basically just re-sends all of the tokens generated by the current response to the KV cache rather than waiting for you to prompt the model again to begin decoding. It's certainly a hacky workaround, but it seriously improves general responsiveness when a model turn generates a whole bunch of tokens, or receives a large amount of info from a tool call. To actually enable this, you just need to start your llama-server and head to the WebUI to enable this, and it applies/works across all requests that hit llama-server, not just in their WebUI

In Open-WebUI for example, I used to have to wait 5-30 seconds (which seems like nothing, but when your model is scraping multiple webpages in a single turn, it really adds up) for prompt processing whenever Qwen would read an incredibly large webpage or something similar. However since enabling this option, it's almost instant.

I haven't noticed any real trade-offs as of yet, and I just thought this would be a good little PSA post to put out there. For those wondering, I'm running Qwen3.6-35B-A3B @ MXFP4, fully offloaded to a single RX 7900 XTX, getting about ~100tps with no MTP atm as it's still not compatible with vision encoders. I imagine this would be even better for those of you using the new MTP patches, particularly the one that introduced MTP for PP.

I had no idea this feature existed, so I hope this helps somebody out! Like I said, it's hacky, but it certainly works!

submitted by /u/ayylmaonade
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA