r/LocalLLaMA · · 2 min read

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I got Qwen 3.6 35B-A3B and Gemma 4 26B-A4B running on a $200 secondhand machine (i7-6700 / GTX 1080 / 32 GB RAM) using llama.cpp (the TurboQuant/RotorQuant KV cache quantisation allows 128k context within the 8 GB VRAM).

Results (Q4_K_M models, 128k context):

Model tok/s Key flags
Qwen 3.6 35B-A3B ~24 --n-cpu-moe 30, K=turbo4 V=turbo3
Gemma 4 26B-A4B (no MTP) ~20 --n-cpu-moe 20, K=V=turbo3, --flash-attn
Gemma 4 26B-A4B + MTP (naive) ~21 embedding table silently on CPU
Gemma 4 26B-A4B + MTP (fixed) ~24.5 --override-tensor-draft "token_embd\.weight=CUDA0"

The trick is MoE offloading: llama.cpp can park the cold expert weights in system RAM, and stream over PCIe to the GPU, while keeping hot layers + KV cache on GPU. The system is fully PCIe bandwidth-limited (GPU sits at ~40-50% utilisation while PCIe 3.0 x16 is maxed out).

Biggest finding: Gemma 4's MTP speculative decoding barely helps out of the box (~5% gain). Turns out llama.cpp unconditionally keeps the token embedding table on CPU. Normally that's fine (just a get_rows lookup), but Gemma 4's MTP assistant has a tied LM head - so every draft token does a full 262k×1024 matmul across PCIe. Forcing it onto GPU with --override-tensor-draft gives the real ~22% speedup and ~79% draft acceptance rate.

Setup pain points (Fedora 42 + Pascal GPU):

  • Pin akmod-nvidia to 580xx branch (Pascal is going legacy)
  • Force gcc-14 for CUDA 12.9 (newer gcc rejected)
  • Patch CUDA's math_functions.h for glibc 2.41 compatibility
  • Used the AtomicBot-ai/atomic-llama-cpp-turboquant fork for both TurboQuant cache + Gemma MTP support

Full blog post with all the grindy build details (every command, and the debugging deep-dive into the MTP embedding table issue)

I'm also planning a YouTube video walkthrough soon - I'll update when that's live.

Happy to answer questions about the setup.

submitted by /u/mdda
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA