llama.cpp - how to free up even more space on your GPU
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
For the past week or two, llama.cpp has been working much better from the RAM usage prespective. I no longer see any memory leaks, and everything fits nicely on the GPU - my defaults are --n-gpu-layers 99 --no-mmap --mlock to avoid using the regular RAM, since I use my 3090 with an eGPU setup: Qwen3.6-27B-UD-Q5_K_XL-mtp, q4_0, 150k context
I wanted to create this thread to see if there are any additional tricks for freeing up even more memory so that I can further increase my context size.
My list of VRAM-related parameters for a given model (which is, of course, the biggest factor in memory footprint):
- --no-mmproj-offload: this is the biggest win: if you have a model with vision, you can offload the mmproj to CPU. It is a little drop in terms of performance, but you'll end up with 1GB additional free space on your card.
- --cache-type-k, --cache-type-v: KV cache (obviously) - reduce memory allocation by 50%, 75%, etc. but of course, quality will drop in return. my observation is that since attention rotation has been introduced, I can even use q4 without much noticable drop of quality, since I can use a bigger base model - which helps me more vs drop of quality because of KV cache.
- --cache-type-k-draft, --cache-type-v-draft: same applies to the mtp model's KV cache
- --spec-draft-n-max: guess up to x future tokens ahead in a single forward pass. With coding, I'm usually fine with "2" as the value. "1" consumes slightly less memory, but TPS drops about 5%. "3" doesn't make sense for my use case - consumes more memory, but same TPS as with "1"
- --flash-attn on: this is the default value by now, as far as I know. Memory allocation would grow if you'd turn it off, but you cannot turn it off anyway if you use a quantized v cache
Parameters I thought would help, until I realized they actually don't:
- --ctx-checkpoints: I've heard that decreasing this value would also decrease memory allocation, but it's not the case for me. Default is 64, and no change for me when I decrease it a small value
- --parallel: number of active user request at a time. Since 1 is the default value, you cannot do anything with it in a single user setup. However, if you increase it, your KV cache for your main session will be reduced accordingly (50%, 66%, etc.)
- --fit-target: sets a strict safety buffer margin (in Megabytes - default 1024) that the engine must leave completely empty on your GPU (for example, reserved for video I/O). Since my monitor is plugged into a different card, I reduced it to 64, but it didn't help at all. As far as I know, llama cpp now runs an internal calculation loop at startup to automatically adjust some variables to prevent itself from an OOM crash.
I've shared my tips, what's one of yours? Is there anything else at all? Is your experience different to mine? thanks!
[link] [comments]
More from r/LocalLLaMA
-
Why Dario is on fire: lesson from dotcom bubble.
Jun 30
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.