Gemma 4 QAT Q4_0 Bench on Strix Halo
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Gemma 4 QAT Q4_0 Bench on Strix Halo
These are Google's official Gemma 4 QAT Q4_0 GGUF models, served locally through llama.cpp Vulkan/RADV on a Strix Halo APU.
QAT means quantization-aware training. Instead of taking a normal model and quantizing it only after training, the model is trained or adapted while accounting for the lower-precision format it will run in. The goal is to make a small Q4 model keep more of the original model's behavior than a simple post-training quantization.
Host
System: AMD Ryzen AI Max+ 395 / Radeon 8060S, gfx1151
Memory: 128 GB unified LPDDR5X
GTT ceiling: 96 GiB class / large-GTT setup
IOMMU: enabled
OS: Linux Mint 22.3 / Ubuntu noble base
Kernel: 6.17.0-23-generic
Mesa / RADV: Mesa 25.2.8 / RADV
Backend: llama.cpp Vulkan/RADV, Atomic llama.cpp TurboQuant fork for Gemma 4 assistant-head MTP
ROCm: installed, but these rows are Vulkan/RADV inference rows
Models
Main model: google/gemma-4-26B-A4B-it-qat-q4_0-gguf
Main model file: gemma-4-26B_q4_0-it.gguf
Main model size on disk: 14,439,361,440 bytes / 13.45 GiB
Architecture: Gemma 4 MoE, roughly 26B total / A4B-ish active lane
Other QAT models tested:
| Model | File size |
|---|---|
| Gemma 4 12B QAT Q4_0 | 6,975,877,728 bytes / 6.50 GiB |
| Gemma 4 26B-A4B QAT Q4_0 | 14,439,361,440 bytes / 13.45 GiB |
| Gemma 4 31B QAT Q4_0 | 17,650,999,456 bytes / 16.44 GiB |
MTP Assistant Heads
The first QAT MTP probes borrowed the normal non-QAT Gemma 4 assistant heads. Those loaded, but acceptance was weak. The better result came from using the matching QAT assistant sources from Google and converting those assistant checkpoints to Atomic/llama.cpp-compatible GGUF heads.
Official QAT assistant sources:
text google/gemma-4-12B-it-qat-q4_0-unquantized-assistant google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant google/gemma-4-31B-it-qat-q4_0-unquantized-assistant
Converted local assistant heads:
| Main model | QAT assistant head | Size |
|---|---|---|
| Gemma 4 12B QAT | gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf | 444 MiB |
| Gemma 4 26B-A4B QAT | gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf | 441 MiB |
| Gemma 4 31B QAT | gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf | 491 MiB |
Conversion note: the assistant GGUF needs the gemma4_assistant metadata shape that this Atomic llama.cpp build expects, including n_embd_backbone and target architecture metadata. A public 31B QAT assistant GGUF I tried used different metadata and did not load as-is. The 12B source repo uses a newer Gemma4UnifiedAssistantForCausalLM config name, so I converted it through a temporary config alias to the existing Gemma 4 assistant converter path. The source weights were not hand-edited.
Latest Measured Numbers
| Lane | Load to listening | Prefill | Decode | Normalized wall, 1150-in/2000-out | Two-slot aggregate | Notes |
|---|---|---|---|---|---|---|
| Gemma 4 26B-A4B QAT Q4_0, plain F16 KV | ~4 s | 1194.4 tok/s | 59.4 tok/s | 34.6 s | 90.9 tok/s | best plain row |
| Gemma 4 26B-A4B QAT Q4_0, QAT MTP + Q8 KV | ~18 s | 729.3 tok/s | 71.4 tok/s | 29.6 s | 62.5 tok/s | best overall QAT lane |
| Gemma 4 12B QAT Q4_0, QAT MTP + Q8 KV | ~10 s | 539.9 tok/s | 45.6 tok/s | 46.0 s | 43.5 tok/s | strong small-model MTP lane |
| Gemma 4 12B QAT Q4_0, plain F16 KV | ~4 s | 666.5 tok/s | 25.7 tok/s | 79.5 s | 47.6 tok/s | plain baseline |
| Gemma 4 31B QAT Q4_0, QAT MTP + F16 KV | ~20 s | 203.6 tok/s | 19.1 tok/s | 110.4 s | 18.9 tok/s | works, but less efficient than 26B-A4B |
| Gemma 4 31B QAT Q4_0, plain Q8 KV | ~8 s | 204.2 tok/s | 11.0 tok/s | 187.4 s | 20.0 tok/s | best plain 31B row |
The main result: the 26B-A4B QAT model is the useful lane. Plain Vulkan already gives about 59 tok/s decode with very strong prefill, and the QAT-matched MTP/Q8 path reaches about 71 tok/s single-stream with much better acceptance than the borrowed-head probe.
Draft Acceptance
Current QAT-matched MTP rows:
| Model | MTP acceptance | Effective acceptance-adjusted decode |
|---|---|---|
| Gemma 4 12B QAT Q4_0 + QAT MTP head | 78.4% | 43.9 tok/s |
| Gemma 4 26B-A4B QAT Q4_0 + QAT MTP head | 91.8% | 71.4 tok/s |
| Gemma 4 31B QAT Q4_0 + QAT MTP head | 60.4% | 19.0 tok/s |
The 26B-A4B row is the standout. It keeps the fast decode lane and acceptance is now high enough that I would treat it as the real QAT MTP result, not just a speed probe.
31B is more of a tradeoff:
| 31B setting | Decode | MTP acceptance |
|---|---|---|
DRAFT_BLOCK_SIZE=3 | 19.1 tok/s | ~60% |
DRAFT_BLOCK_SIZE=2 | 16.5-17.1 tok/s | ~76% |
DRAFT_BLOCK_SIZE=1 is not accepted by this build; the allowed range starts at 2. DRAFT_P_MIN did not materially change the 31B acceptance in my short sweep.
For comparison, the earlier borrowed-head QAT MTP rows were lower quality as MTP stacks:
| Model | Borrowed-head acceptance | QAT-matched acceptance |
|---|---|---|
| Gemma 4 26B-A4B QAT Q4_0 + MTP | 56.9% | 91.8% |
| Gemma 4 31B QAT Q4_0 + MTP | 42.5% | 60.4% |
Context Against Previous Local Gemma Rows
| Model / lane | Quant / path | Prefill | Decode |
|---|---|---|---|
| Gemma 4 26B-A4B non-QAT | UD-Q6_K_XL, plain Vulkan | 1002.8 tok/s | 44.8 tok/s |
| Gemma 4 26B-A4B QAT | Q4_0, plain Vulkan | 1194.4 tok/s | 59.4 tok/s |
| Gemma 4 26B-A4B QAT | Q4_0 + QAT MTP/Q8 KV | 729.3 tok/s | 71.4 tok/s |
| Gemma 4 31B non-QAT | Q6 plain Vulkan | 151.3 tok/s | ~8.1 tok/s |
| Gemma 4 31B QAT | Q4_0 plain Vulkan | 204.2 tok/s | 11.0 tok/s |
| Gemma 4 31B QAT | Q4_0 + QAT MTP/F16 KV | 203.6 tok/s | 19.1 tok/s |
| Gemma 4 12B QAT | Q4_0 plain Vulkan | 666.5 tok/s | 25.7 tok/s |
| Gemma 4 12B QAT | Q4_0 + QAT MTP/Q8 KV | 539.9 tok/s | 45.6 tok/s |
Takeaway
On a 128 GB Strix Halo APU, Google's official Gemma 4 26B-A4B QAT Q4_0 GGUF is a very strong local lane: about 59 tok/s plain and about 71 tok/s with the QAT-matched MTP/Q8 setup.
The important update is that QAT-matched assistant heads matter. Borrowing the normal non-QAT assistant heads was useful for proving that MTP could load, but the matched QAT heads substantially improved acceptance, especially on the 26B-A4B row.
I have not verified these QAT MTP rows on stock upstream llama.cpp or vLLM locally. The measured claim here is the Atomic llama.cpp TurboQuant fork on Vulkan/RADV.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.