Seeking resources to read about llama.cpp server and how offloading works
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
SETUP INFO: Amd R9700 AI PRO. Using llama-cpp server, ROCM docker version. Using the --ngl option to offload.
First of all, I'm greatly impressed by how llama-cpp server handles offloading. There's some fucking magic happening here, at least to me.
I have 32gb of VRAM so loading in the small models is no problem, but now I'm starting to experiment with models that spill into system RAM, testing tok/sec differences and various quants.
I'm currently testing Qwen3 Coder Next. At Q4-KM, this thing weighs in at 45gb in size. I can make that one work, but the more offloading I do, the slower it is (obviously). Thus, I am currently however testing the smaller 4-bit quant, IQ4_XS at 36gb trying to find the middle ground before quality starts to suffer.
If I offload 36 layers, it fills my vram 30/32gb. Tok / sec is around 25, which for an MoE model is not great at all - at least I don't think it is. I tried the 3-bit quant which fits fully in memory, but after multiple quality issues, I gave up on it. I think for large models and coding, 3-bit is just too much compression, or at least it feels like it. (anyone else have this impression? or is it just me?)
Anyways - to my actual question - how the hell does llama-cpp do this magic? I am monitoring RAM usage and swap file and neither of them are very high, yet I only have 30gb loaded out of this model, including 120k unquantized KV cache context... It's basically impossible, so clearly I am missing something about how Kubuntu 24.04 manages system resources.
Is my KDE5 widget for RAM not capturing what llama-cpp is up to? I'd like to read up on how it works or if someone can explain it to my dumb ass, I'd greatly appreciate it lol.
EDIT: Offloading also has a nice bonus benefit of being QUIET. For anyone with a very loud GPU fan, it's a nice break. Yes it's slower but I can work on other tabs and windows while it processes and actually hear myself think. I might do more of this.
[link] [comments]
More from r/LocalLLaMA
-
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)
May 22
-
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
May 22
-
trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser
May 22
-
ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.