r/LocalLLaMA · · 1 min read

GPU VRAM only for small models with llama.cpp: is it possible?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I'm still in my learning process and so far I've been able to make satisfying use of my setup (4070 with 12GB VRAM + 32GB RAM and iGPU for my GUI). I've been able to run both Gemma4 26B and Qwen 3.6 35B MoEs up to high quants with large context and have about 40 t/s with both.

However, I'd like to try a smaller model, ideally a quant of Qwen3.5-9B, with full VRAM usage and no host memory to slow down things. In theory it should be possible, but even gemma4-e2b with a low quant (Q4_IXS) with small context (8192) ends up using about 3.5 GB of RAM on top of the GPU.

I've tried all the command line options I could find with llama-server, but so far...no cigar.

What am I doing wrong?

submitted by /u/Ps3Dave
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA