GPU VRAM only for small models with llama.cpp: is it possible?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I'm still in my learning process and so far I've been able to make satisfying use of my setup (4070 with 12GB VRAM + 32GB RAM and iGPU for my GUI). I've been able to run both Gemma4 26B and Qwen 3.6 35B MoEs up to high quants with large context and have about 40 t/s with both.
However, I'd like to try a smaller model, ideally a quant of Qwen3.5-9B, with full VRAM usage and no host memory to slow down things. In theory it should be possible, but even gemma4-e2b with a low quant (Q4_IXS) with small context (8192) ends up using about 3.5 GB of RAM on top of the GPU.
I've tried all the command line options I could find with llama-server, but so far...no cigar.
What am I doing wrong?
[link] [comments]
More from r/LocalLLaMA
-
BitCPM-CANN: Native 1.58-Bit Large Language Model Training on Ascend NPU
May 24
-
Qwen3.6-35B-A3B vs Gemma4-26B-A4B
May 24
-
Qwen Plays ̶p̶̶o̶̶k̶̶e̶̶m̶̶o̶̶n̶ ? / QWEN PLAYS DCSS! - qwen3.6-35b-a3b@q4_k_xl plays open source roguelike adventure DCSS (and does a decent job)
May 24
-
gemma 4 e2b quality degrades after ~30-40 continuous inferences on 4gb vram?
May 24
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.