r/LocalLLaMA · · 2 min read

Maybe KV cache offload to RAM isn't bad

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

So, llama.cpp has the -nkvo (--no-kv-offload) option to offload KV cache to RAM instead of VRAM. Many people avoid this because obviously it hurts performance.

But every option exists with a trade off. And in my case, I think it's worth it. Hear me out.

I'm running Qwen3.6 27B (IQ4_XS) on RTX 5060 Ti 16GB and 32GB DDR5. In order to fit 65k context, I have to quantize the KV cache down to q4_0, and keep only 58 layers on the GPU. This gives me 23 tps at peak, down to 16 tps during long generation.

llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 65000 \ -ctk q4_0 -ctv q4_0 -fa on -ngl 58 -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 

Adding -nkvo, I'm able to fit the whole model in GPU, and have the default f16 for KV cache. The speed plunged to 19 tps at peak, and 14 tps during long generation. Not a bad trade off.

llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 65000 \ -fa on -ngl 99 -nkvo -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 

The interesting part is, I can even double the context window to 128k by keeping 63 out of 65 layers (for the MTP version) on the GPU. The generation speed didn't change much.

llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 131072 \ -fa on -ngl 63 -nkvo -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 

KV cache quant when offload to RAM didn't seem to give any improvement, so we basically get f16 quality for free. In some cases, I found it hurts the performance as well.

So the takeaway is, if you found yourself lowering down the KV cache just to make the model fit, or needing more context window, you might better get away by offloading the KV cache to RAM instead.

submitted by /u/bobaburger
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA