r/LocalLLaMA · · 3 min read

Experimentation with Qwen 3.6 and Gemma 4 - Guidance needed

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I’m a web developer doing mostly coding, but also project management, requirements analysis, testing, etc. I recently started experimenting with local LLMs, mostly because agentic stuff finally made them feel useful. Note: This text was fed to chartgpt to fix my messy repeating grammar

My initial impression was honestly pretty discouraging. Endless model option confusion, benchmarks that are hard to translate, huge VRAM requirements and hardware prices that are completely unreasonable.

Still, it feels like things have started shifting. MoE models, smarter quantization, speculative decoding, QAT releases, MTP, etc. The ecosystem finally feels like it’s targeting more reasonable setups instead of just brute-forcing huge models into gigantic VRAM.

Before committing to expensive hardware, I thought I'd test with what I had in hand. A small rig with i5-12400, 64GB DDR4 and 2x GTX 1050 Ti 4GB

Honestly, I expected it to be unusable. Surprisingly, it has been viable.

With Gemma-4 and Qwen 3.6 MoE models I’m getting roughly:

  • ~40 t/s prompt processing
  • ~12-18 t/s token generation depending on model/config

Prompt processing is probably the weakest point, especially with opencode passing its tools etc in large prompts. But generation speed already feels real-time enough for productivity if I keep things focused.

Current observations:

  • Speed was rather similar between MOE versions of Qwen 3.6 and Gemma 4
  • I don't care for large automated workflows
  • Most of the time I ask for specific simple tasks like review this file, write me test cases for this file, translate this file, review this, and so on. Context hovers at 16-32K most of the time. I don't expect the model to automatically do my work on huge projects.
  • Qwen MTP pushed it to ~15 t/s generation
  • Gemma feels better linguistically
  • The new Gemma QAT with some more optimization of options pushed me to ~18 t/s even before MTP

Right now I’m testing:

unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL

on llama.cpp with:

  • 32k context
  • 6 CPU threads
  • split across both 1050 Ti cards
  • q8 KV cache

The hardest part has been balancing MoE experts between CPU/GPU memory while leaving enough VRAM for context and compute buffers. Simple -fit left gpu memory unbalanced and with big chunks empty. A single gpu is probably easier to optimize.

Current arguments with CUDA enabled:

-hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL -t 6 -fa on -b 256 -ub 128 --n-cpu-moe 18 --split-mode layer --tensor-split 3,1 -rea off --repeat-penalty 1.0 --parallel 1 --jinja -fit on --top-p 0.95 --top-k 64 --temp 1.0 --no-mmproj --no-mmap --mlock --ctx-size 32768 -ctk q8_0 -ctv q8_0 

I also tested Vulkan, but performance dropped to around ~13 t/s generation and I ran into some mmap/mlock weirdness.

I’d really appreciate input on:

  • settings that might improve prompt processing speed
  • Any Agents.md tricks
  • whether I’m doing something obviously inefficient
  • whether upgrading to a bigger or smaller modern GPU is actually worth it for such use case
  • AMD vs NVIDIA specifically for llama.cpp with opencode in 2026

Locally, pricing is weird:

  • The second-hand market is laughable
  • RTX 5060 Ti 16GB starts around 700€ and not directly available
  • Radeon 9060 XT 16GB is available and around 450€

I don’t mind slightly lower performance, but I do mind fighting instabilities or incompatibilities.

Curious what people here would do in this situation.

Edit:

Fixed CPU threads to 6 (15 came from the script I used with my ryzen 7 1700)

submitted by /u/j0hnp0s
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA