r/LocalLLaMA · June 6, 2026 · 5 min read

Gemma 4 QAT Q4_0 Bench on Strix Halo

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Gemma 4 QAT Q4_0 Bench on Strix Halo

These are Google's official Gemma 4 QAT Q4_0 GGUF models, served locally through llama.cpp Vulkan/RADV on a Strix Halo APU.

QAT means quantization-aware training. Instead of taking a normal model and quantizing it only after training, the model is trained or adapted while accounting for the lower-precision format it will run in. The goal is to make a small Q4 model keep more of the original model's behavior than a simple post-training quantization.

Host

System: AMD Ryzen AI Max+ 395 / Radeon 8060S, gfx1151

Memory: 128 GB unified LPDDR5X

GTT ceiling: 96 GiB class / large-GTT setup

IOMMU: enabled

OS: Linux Mint 22.3 / Ubuntu noble base

Kernel: 6.17.0-23-generic

Mesa / RADV: Mesa 25.2.8 / RADV

Backend: llama.cpp Vulkan/RADV, Atomic llama.cpp TurboQuant fork for Gemma 4 assistant-head MTP

ROCm: installed, but these rows are Vulkan/RADV inference rows

Models

Main model: google/gemma-4-26B-A4B-it-qat-q4_0-gguf

Main model file: gemma-4-26B_q4_0-it.gguf

Main model size on disk: 14,439,361,440 bytes / 13.45 GiB

Architecture: Gemma 4 MoE, roughly 26B total / A4B-ish active lane

Other QAT models tested:

Model	File size
Gemma 4 12B QAT Q4_0	`6,975,877,728` bytes / `6.50 GiB`
Gemma 4 26B-A4B QAT Q4_0	`14,439,361,440` bytes / `13.45 GiB`
Gemma 4 31B QAT Q4_0	`17,650,999,456` bytes / `16.44 GiB`

MTP Assistant Heads

The first QAT MTP probes borrowed the normal non-QAT Gemma 4 assistant heads. Those loaded, but acceptance was weak. The better result came from using the matching QAT assistant sources from Google and converting those assistant checkpoints to Atomic/llama.cpp-compatible GGUF heads.

Official QAT assistant sources:

text google/gemma-4-12B-it-qat-q4_0-unquantized-assistant google/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant google/gemma-4-31B-it-qat-q4_0-unquantized-assistant

Converted local assistant heads:

Main model	QAT assistant head	Size
Gemma 4 12B QAT	`gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf`	444 MiB
Gemma 4 26B-A4B QAT	`gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf`	441 MiB
Gemma 4 31B QAT	`gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf`	491 MiB

Conversion note: the assistant GGUF needs the gemma4_assistant metadata shape that this Atomic llama.cpp build expects, including n_embd_backbone and target architecture metadata. A public 31B QAT assistant GGUF I tried used different metadata and did not load as-is. The 12B source repo uses a newer Gemma4UnifiedAssistantForCausalLM config name, so I converted it through a temporary config alias to the existing Gemma 4 assistant converter path. The source weights were not hand-edited.

Latest Measured Numbers

Lane	Load to listening	Prefill	Decode	Normalized wall, 1150-in/2000-out	Two-slot aggregate	Notes
Gemma 4 26B-A4B QAT Q4_0, plain F16 KV	~4 s	1194.4 tok/s	59.4 tok/s	34.6 s	90.9 tok/s	best plain row
Gemma 4 26B-A4B QAT Q4_0, QAT MTP + Q8 KV	~18 s	729.3 tok/s	71.4 tok/s	29.6 s	62.5 tok/s	best overall QAT lane
Gemma 4 12B QAT Q4_0, QAT MTP + Q8 KV	~10 s	539.9 tok/s	45.6 tok/s	46.0 s	43.5 tok/s	strong small-model MTP lane
Gemma 4 12B QAT Q4_0, plain F16 KV	~4 s	666.5 tok/s	25.7 tok/s	79.5 s	47.6 tok/s	plain baseline
Gemma 4 31B QAT Q4_0, QAT MTP + F16 KV	~20 s	203.6 tok/s	19.1 tok/s	110.4 s	18.9 tok/s	works, but less efficient than 26B-A4B
Gemma 4 31B QAT Q4_0, plain Q8 KV	~8 s	204.2 tok/s	11.0 tok/s	187.4 s	20.0 tok/s	best plain 31B row

The main result: the 26B-A4B QAT model is the useful lane. Plain Vulkan already gives about 59 tok/s decode with very strong prefill, and the QAT-matched MTP/Q8 path reaches about 71 tok/s single-stream with much better acceptance than the borrowed-head probe.

Draft Acceptance

Current QAT-matched MTP rows:

Model	MTP acceptance	Effective acceptance-adjusted decode
Gemma 4 12B QAT Q4_0 + QAT MTP head	78.4%	43.9 tok/s
Gemma 4 26B-A4B QAT Q4_0 + QAT MTP head	91.8%	71.4 tok/s
Gemma 4 31B QAT Q4_0 + QAT MTP head	60.4%	19.0 tok/s

The 26B-A4B row is the standout. It keeps the fast decode lane and acceptance is now high enough that I would treat it as the real QAT MTP result, not just a speed probe.

31B is more of a tradeoff:

31B setting	Decode	MTP acceptance
`DRAFT_BLOCK_SIZE=3`	19.1 tok/s	~60%
`DRAFT_BLOCK_SIZE=2`	16.5-17.1 tok/s	~76%

DRAFT_BLOCK_SIZE=1 is not accepted by this build; the allowed range starts at 2. DRAFT_P_MIN did not materially change the 31B acceptance in my short sweep.

For comparison, the earlier borrowed-head QAT MTP rows were lower quality as MTP stacks:

Model	Borrowed-head acceptance	QAT-matched acceptance
Gemma 4 26B-A4B QAT Q4_0 + MTP	56.9%	91.8%
Gemma 4 31B QAT Q4_0 + MTP	42.5%	60.4%

Context Against Previous Local Gemma Rows

Model / lane	Quant / path	Prefill	Decode
Gemma 4 26B-A4B non-QAT	UD-Q6_K_XL, plain Vulkan	1002.8 tok/s	44.8 tok/s
Gemma 4 26B-A4B QAT	Q4_0, plain Vulkan	1194.4 tok/s	59.4 tok/s
Gemma 4 26B-A4B QAT	Q4_0 + QAT MTP/Q8 KV	729.3 tok/s	71.4 tok/s
Gemma 4 31B non-QAT	Q6 plain Vulkan	151.3 tok/s	~8.1 tok/s
Gemma 4 31B QAT	Q4_0 plain Vulkan	204.2 tok/s	11.0 tok/s
Gemma 4 31B QAT	Q4_0 + QAT MTP/F16 KV	203.6 tok/s	19.1 tok/s
Gemma 4 12B QAT	Q4_0 plain Vulkan	666.5 tok/s	25.7 tok/s
Gemma 4 12B QAT	Q4_0 + QAT MTP/Q8 KV	539.9 tok/s	45.6 tok/s

Takeaway

On a 128 GB Strix Halo APU, Google's official Gemma 4 26B-A4B QAT Q4_0 GGUF is a very strong local lane: about 59 tok/s plain and about 71 tok/s with the QAT-matched MTP/Q8 setup.

The important update is that QAT-matched assistant heads matter. Borrowing the normal non-QAT assistant heads was useful for proving that MTP could load, but the matched QAT heads substantially improved acceptance, especially on the 26B-A4B row.

I have not verified these QAT MTP rows on stock upstream llama.cpp or vLLM locally. The measured claim here is the Atomic llama.cpp TurboQuant fork on Vulkan/RADV.

submitted by /u/westsunset
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Gemma 4 QAT Q4_0 Bench on Strix Halo

Host

Models

MTP Assistant Heads

Latest Measured Numbers

Draft Acceptance

Context Against Previous Local Gemma Rows

Takeaway

Discussion (0)

More from r/LocalLLaMA