r/LocalLLaMA · · 2 min read

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I’ve been doing lots of testing back and forth with this 7900xtx. All of my workloads were relying on qwen3.6 models, which are amazing fwiw, but I wanted some diversity in thought. Namely for Honcho workload tiers and differing cron jobs. Not every workload benefits from an agentic-tuned model, so I’ve been testing out Gemma 4 models more. They also dropped quantization-aware training versions of the Gemma 4 family, which reportedly maintain the fidelity of BF16 weights, but with Q4 weights.

I ran an A/B comparison between the two sets to see how they differ, and if there’s any significant difference. Smaller models with faster speeds at high fidelity? Who doesn’t love a free lunch!

Here’s a write-up with config versions/flags/etc. My agent didn’t grab actual tok/s measurements (of course right) but you get a rough idea with the general wall clock times.

Full writeup with data: https://kmarble.dev/posts/gemma-4-qat-benchmark-same-quality-faster-less-vram/

TL;DR by model:

• 12B QAT over Q8_0 — the standout swap. Cut total generation time from 323s to 176s (45% faster), throughput up 83%, saves 5.7GB VRAM. Quality identical across all prompts. On constraint-following, regular Q8_0 spent 124 seconds iterating drafts while QAT nailed it in 24.

• 26B QAT over UD-Q4 — lean yes. Consistent moderate gains (1.0x-1.38x speedup), saves 2GB VRAM. No quality degradation observed on any prompt type at temp=1.0.

• 31B QAT over Q4_K_M — worth it despite small VRAM savings. 1.3x-1.5x faster, actually produced 8% more total output. On creative continuation: regular generated 710 chars and stopped, QAT went to 1256.

• E4B — skip for now. Results confounded by bit-width difference (regular was q8_0, QAT is q4-level). Need same-precision comparison.

Tested on single AMD 7900 XTX/ROCm via llama-swap at temp=1.0 with no token cap. Full raw outputs (~170KB markdown) for anyone who wants to dig into the actual generations.

submitted by /u/IvGranite
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA