r/LocalLLaMA · · 5 min read

Gemma 4 QAT Q4_0 Bench on Strix Halo

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Gemma 4 QAT Q4_0 Bench on Strix Halo

These are Google's official Gemma 4 QAT Q4_0 GGUF models, served locally through llama.cpp Vulkan/RADV on a Strix Halo APU.

QAT means quantization-aware training. Instead of taking a normal model and quantizing it only after training, the model is trained or adapted while accounting for the lower-precision format it will run in. The goal is to make a small Q4 model keep more of the original model's behavior than a simple post-training quantization.

Host

System: AMD Ryzen AI Max+ 395 / Radeon 8060S, gfx1151

Memory: 128 GB unified LPDDR5X

GTT ceiling: 96 GiB class / large-GTT setup

IOMMU: enabled

OS: Linux Mint 22.3 / Ubuntu noble base

Kernel: 6.17.0-23-generic

Mesa / RADV: Mesa 25.2.8 / RADV

Backend: llama.cpp Vulkan/RADV, Atomic llama.cpp TurboQuant fork for Gemma 4 assistant-head MTP

ROCm: installed, but these rows are Vulkan/RADV inference rows

Models

Main model: google/gemma-4-26B-A4B-it-qat-q4_0-gguf

Main model file: gemma-4-26B_q4_0-it.gguf

Main model size on disk: 14,439,361,440 bytes / 13.45 GiB

Architecture: Gemma 4 MoE, roughly 26B total / A4B-ish active lane

Other QAT models tested:

Model File size
Gemma 4 12B QAT Q4_0 6,975,877,728 bytes / 6.50 GiB
Gemma 4 26B-A4B QAT Q4_0 14,439,361,440 bytes / 13.45 GiB
Gemma 4 31B QAT Q4_0 17,650,999,456 bytes / 16.44 GiB

MTP Assistant Heads

The first QAT MTP probes borrowed the normal non-QAT Gemma 4 assistant heads. Those loaded, but acceptance was weak. The better result came from using the matching QAT assistant sources from Google and converting those assistant checkpoints to Atomic/llama.cpp-compatible GGUF heads.

Official QAT assistant sources:

text google/gemma-4-12B-it-qat-q4_0-unquantized-assistant google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant google/gemma-4-31B-it-qat-q4_0-unquantized-assistant

Converted local assistant heads:

Main model QAT assistant head Size
Gemma 4 12B QAT gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf 444 MiB
Gemma 4 26B-A4B QAT gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf 441 MiB
Gemma 4 31B QAT gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf 491 MiB

Conversion note: the assistant GGUF needs the gemma4_assistant metadata shape that this Atomic llama.cpp build expects, including n_embd_backbone and target architecture metadata. A public 31B QAT assistant GGUF I tried used different metadata and did not load as-is. The 12B source repo uses a newer Gemma4UnifiedAssistantForCausalLM config name, so I converted it through a temporary config alias to the existing Gemma 4 assistant converter path. The source weights were not hand-edited.

Latest Measured Numbers

Lane Load to listening Prefill Decode Normalized wall, 1150-in/2000-out Two-slot aggregate Notes
Gemma 4 26B-A4B QAT Q4_0, plain F16 KV ~4 s 1194.4 tok/s 59.4 tok/s 34.6 s 90.9 tok/s best plain row
Gemma 4 26B-A4B QAT Q4_0, QAT MTP + Q8 KV ~18 s 729.3 tok/s 71.4 tok/s 29.6 s 62.5 tok/s best overall QAT lane
Gemma 4 12B QAT Q4_0, QAT MTP + Q8 KV ~10 s 539.9 tok/s 45.6 tok/s 46.0 s 43.5 tok/s strong small-model MTP lane
Gemma 4 12B QAT Q4_0, plain F16 KV ~4 s 666.5 tok/s 25.7 tok/s 79.5 s 47.6 tok/s plain baseline
Gemma 4 31B QAT Q4_0, QAT MTP + F16 KV ~20 s 203.6 tok/s 19.1 tok/s 110.4 s 18.9 tok/s works, but less efficient than 26B-A4B
Gemma 4 31B QAT Q4_0, plain Q8 KV ~8 s 204.2 tok/s 11.0 tok/s 187.4 s 20.0 tok/s best plain 31B row

The main result: the 26B-A4B QAT model is the useful lane. Plain Vulkan already gives about 59 tok/s decode with very strong prefill, and the QAT-matched MTP/Q8 path reaches about 71 tok/s single-stream with much better acceptance than the borrowed-head probe.

Draft Acceptance

Current QAT-matched MTP rows:

Model MTP acceptance Effective acceptance-adjusted decode
Gemma 4 12B QAT Q4_0 + QAT MTP head 78.4% 43.9 tok/s
Gemma 4 26B-A4B QAT Q4_0 + QAT MTP head 91.8% 71.4 tok/s
Gemma 4 31B QAT Q4_0 + QAT MTP head 60.4% 19.0 tok/s

The 26B-A4B row is the standout. It keeps the fast decode lane and acceptance is now high enough that I would treat it as the real QAT MTP result, not just a speed probe.

31B is more of a tradeoff:

31B setting Decode MTP acceptance
DRAFT_BLOCK_SIZE=3 19.1 tok/s ~60%
DRAFT_BLOCK_SIZE=2 16.5-17.1 tok/s ~76%

DRAFT_BLOCK_SIZE=1 is not accepted by this build; the allowed range starts at 2. DRAFT_P_MIN did not materially change the 31B acceptance in my short sweep.

For comparison, the earlier borrowed-head QAT MTP rows were lower quality as MTP stacks:

Model Borrowed-head acceptance QAT-matched acceptance
Gemma 4 26B-A4B QAT Q4_0 + MTP 56.9% 91.8%
Gemma 4 31B QAT Q4_0 + MTP 42.5% 60.4%

Context Against Previous Local Gemma Rows

Model / lane Quant / path Prefill Decode
Gemma 4 26B-A4B non-QAT UD-Q6_K_XL, plain Vulkan 1002.8 tok/s 44.8 tok/s
Gemma 4 26B-A4B QAT Q4_0, plain Vulkan 1194.4 tok/s 59.4 tok/s
Gemma 4 26B-A4B QAT Q4_0 + QAT MTP/Q8 KV 729.3 tok/s 71.4 tok/s
Gemma 4 31B non-QAT Q6 plain Vulkan 151.3 tok/s ~8.1 tok/s
Gemma 4 31B QAT Q4_0 plain Vulkan 204.2 tok/s 11.0 tok/s
Gemma 4 31B QAT Q4_0 + QAT MTP/F16 KV 203.6 tok/s 19.1 tok/s
Gemma 4 12B QAT Q4_0 plain Vulkan 666.5 tok/s 25.7 tok/s
Gemma 4 12B QAT Q4_0 + QAT MTP/Q8 KV 539.9 tok/s 45.6 tok/s

Takeaway

On a 128 GB Strix Halo APU, Google's official Gemma 4 26B-A4B QAT Q4_0 GGUF is a very strong local lane: about 59 tok/s plain and about 71 tok/s with the QAT-matched MTP/Q8 setup.

The important update is that QAT-matched assistant heads matter. Borrowing the normal non-QAT assistant heads was useful for proving that MTP could load, but the matched QAT heads substantially improved acceptance, especially on the 26B-A4B row.

I have not verified these QAT MTP rows on stock upstream llama.cpp or vLLM locally. The measured claim here is the Atomic llama.cpp TurboQuant fork on Vulkan/RADV.

submitted by /u/westsunset
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA