r/LocalLLaMA · · 2 min read

Whats actually happening when a model spills out of VRAM into system memory?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

So as far as I understand it, llama.cpp can run models across multiple different sources of compute (multiple GPU, multi-core cpu, cpu+gpu, etc). However, what I'm not understanding is how that split occurs so that I can better optimize my settings and flags and whatnot.

For example, I'm running unsloth gemma4 26b Q5_K_XL for my personal project management/smarthome agent. I have an RX6600XT and a Ryzen 7 5700X, 32GB DDR4 at 3200mhz. The model is about 21GB in size and is absolutely spilling into system memory. My command is as follows:

./llama-server -m ~/llamacpp/models/gemma-4-26B-A4B-it-UD-Q5_K_XL.gguf --spec-type ngram-mod --spec-ngram-mod-n-match 24 --spec-draft-n-min 12 --spec-draft-n-max 48 -fa on --host 0.0.0.0 --port 8080 -fitc 40000 --reasoning-budget 3072 -t 8 -np 1 -fitt 192 

With this setup I'm getting around 20-ish tokens per second decode, 235ish prefill.

Some of the flags are just straight up copy-pasted from this sub, and the values are just kind of based on vibes. I'm not very good at this, and I'm sure I'm doing everything wrong. Criticism and suggestions are welcome. My agent prompt is well optimized for KV cache reuse, so I'm more focused on decode. I was using the atomic bot fork of llama.cpp for gemma MTP but prefill was so bad (even with KV cache reuse) that the time it took to actually get a response was generally faster without it.

My specific question is how the cpu/gpu split is handled? The general idea that is implied is that some of the model runs on cpu, and some of the model on GPU, but I also read about how the bits of the model being acted upon at any moment need to be in the GPU, so you're constantly swapping pieces of the model from system memory to GPU memory, which tells me that the CPU isn't actually all that important, but PCIe bus speed and system memory speed is really important. But if it works the way that it looks like it does on the outside, where the bits of the model that live in system memory just run on CPU, then I should do classic CPU and memory overclocking to get as much compute performance and memory bandwidth as I can achieve with a semblance of stability.

EDIT: I realized I should include my OS. Ubuntu 26.04, relatively unmodified. When the model is running on it, the system is set up to be effectively headless, so all of the GPU memory is available.

submitted by /u/Mrinohk
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA