r/LocalLLaMA · June 6, 2026 · 3 min read

QAT MTP Heads Upload + PARALLEL=2 Fix + 12B 2-slot Bench

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Title: Gemma 4 QAT MTP assistant heads now public on HuggingFace + PARALLEL=2 crash fix + 12B 2-slot bench (Strix Halo / Vulkan)

Three things in one update: the converted QAT-matched draft heads are now uploaded for anyone to use, we found and fixed the PARALLEL=2 crash in both the Atomic fork and filed the same bug on the native llama.cpp PR, and here are the first 12B 2-slot numbers after the fix.

1. QAT-matched MTP heads are now on HuggingFace

boxwrench/gemma-4-qat-mtp-assistant-heads

Three draft heads for speculative decoding (MTP) with the official Gemma 4 QAT Q4_0 models:

File	Pairs with	Size
`gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf`	google/gemma-4-12B-it-qat-q4_0	444 MiB
`gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf`	google/gemma-4-26B-A4B-it-qat-q4_0	441 MiB
`gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf`	google/gemma-4-31B-it-qat-q4_0	491 MiB

Converted from Google's official unquantized QAT assistant checkpoints (google/gemma-4-{12B,26B-A4B,31B}-it-qat-q4_0-unquantized-assistant) to gemma4_assistant GGUF Q8_0.

Why QAT-matched heads matter: A draft head guesses tokens ahead of the main model. If the head was trained against full-precision weights but the main model is a QAT quantization, their distributions diverge — the head guesses what the full-precision model would have said, and the QAT model disagrees more often. Using heads that were trained against the same QAT checkpoint closes that gap substantially:

Model	Non-QAT head	QAT-matched head	Change
12B QAT Q4_0	71.3%	78.4%	+7 pp
26B-A4B QAT Q4_0	56.9%	91.8%	+35 pp
31B QAT Q4_0	42.5%	60.4%	+18 pp

The 26B-A4B gap was especially stark — nearly 35 percentage points of acceptance rate were being lost purely to the head mismatch.

Compatibility: These use the gemma4_assistant architecture. They load on the Atomic TurboQuant fork now, and on stock llama.cpp once PR #23398 merges (it uses the same architecture shape). They are not compatible with the ik_llama variant heads (gemma4_mtp format).

2. PARALLEL=2 crash — root cause found and fixed

Running --n-parallel 2 with any of these heads was crashing with an assertion failure:

GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2)

in ggml_reshape_3d, called from llm_build_gemma4_mtp.

The root cause was a single line in gemma4-assistant.cpp:

```cpp // before (crashes at n_parallel=2): Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);

// fix: Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, 1); ```

The MTP draft step always processes exactly one token column regardless of how many server slots are active. Using n_tokens from the main forward pass worked fine with one slot (where n_tokens=1 anyway) but crashed as soon as a second slot fired its first draft step (n_tokens=2, element count mismatch).

The fix also requires LLAMA_PIPELINE_DEPTH2=0 as an env var on Vulkan to prevent thread queue deadlocks when two slots are active simultaneously.

Fix submitted to the Atomic fork: AtomicBot-ai/atomic-llama-cpp-turboquant#26.

The same bug is in the native llama.cpp PR #23398 — identical line, identical fix. Left a comment on that PR so it can be patched before merge.

Also worth calling out: u/janvitos independently did the same work for stock llama.cpp — PR #23398 adds Gemma 4 MTP support to the native build. Great minds think alike. Once that merges, the heads here will load on stock llama.cpp without needing the Atomic fork.

3. First 12B PARALLEL=2 bench (post-fix, Strix Halo / Vulkan)

Hardware: AMD Ryzen AI Max+ 395, 128 GB LPDDR5X unified, Vulkan/RADV (Mesa 25.2.8)

Metric	12B plain 2-slot	12B MTP PARALLEL=1	12B MTP PARALLEL=2
2-slot aggregate	47.6 tok/s	43.5 tok/s	62.5 tok/s
Single-stream decode	—	45.6 tok/s	38.6 tok/s (48.6 eff.)
MTP acceptance	N/A	78.4%	88.6%
Wall time (1150-in/2000-out)	79.5 s	46.0 s	53.9 s

MTP PARALLEL=2 aggregate +31% over plain 2-slot. Per-slot decode drops (two slots share the same bandwidth), but total output throughput improves and acceptance actually went up relative to single-slot — the model is doing more useful speculation per pass when it has two requests to interleave.

26B-A4B PARALLEL=2 bench still running — expected to close or surpass the plain 2-slot 90.9 tok/s figure since the 26B-A4B has higher acceptance to start with.

Full numbers and context

Previous QAT numbers post (plain + single-slot MTP): Gemma 4 QAT Q4_0 bench on Strix Halo

Full benchmark data, reproducibility matrix, serve scripts: boxwrench/tesla_agent

HF model card has usage examples including the LLAMA_PIPELINE_DEPTH2=0 flag and the --mtp-draft-n 3 --draft-p-min 0.75 settings that gave these numbers.

submitted by /u/westsunset
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

1. QAT-matched MTP heads are now on HuggingFace

2. PARALLEL=2 crash — root cause found and fixed

3. First 12B PARALLEL=2 bench (post-fix, Strix Halo / Vulkan)

Full numbers and context

Discussion (0)

More from r/LocalLLaMA