r/LocalLLaMA · · 3 min read

QAT MTP Heads Upload + PARALLEL=2 Fix + 12B 2-slot Bench

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.


Title: Gemma 4 QAT MTP assistant heads now public on HuggingFace + PARALLEL=2 crash fix + 12B 2-slot bench (Strix Halo / Vulkan)


Three things in one update: the converted QAT-matched draft heads are now uploaded for anyone to use, we found and fixed the PARALLEL=2 crash in both the Atomic fork and filed the same bug on the native llama.cpp PR, and here are the first 12B 2-slot numbers after the fix.


1. QAT-matched MTP heads are now on HuggingFace

boxwrench/gemma-4-qat-mtp-assistant-heads

Three draft heads for speculative decoding (MTP) with the official Gemma 4 QAT Q4_0 models:

File Pairs with Size
gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf google/gemma-4-12B-it-qat-q4_0 444 MiB
gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf google/gemma-4-26B-A4B-it-qat-q4_0 441 MiB
gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf google/gemma-4-31B-it-qat-q4_0 491 MiB

Converted from Google's official unquantized QAT assistant checkpoints (google/gemma-4-{12B,26B-A4B,31B}-it-qat-q4_0-unquantized-assistant) to gemma4_assistant GGUF Q8_0.

Why QAT-matched heads matter: A draft head guesses tokens ahead of the main model. If the head was trained against full-precision weights but the main model is a QAT quantization, their distributions diverge — the head guesses what the full-precision model would have said, and the QAT model disagrees more often. Using heads that were trained against the same QAT checkpoint closes that gap substantially:

Model Non-QAT head QAT-matched head Change
12B QAT Q4_0 71.3% 78.4% +7 pp
26B-A4B QAT Q4_0 56.9% 91.8% +35 pp
31B QAT Q4_0 42.5% 60.4% +18 pp

The 26B-A4B gap was especially stark — nearly 35 percentage points of acceptance rate were being lost purely to the head mismatch.

Compatibility: These use the gemma4_assistant architecture. They load on the Atomic TurboQuant fork now, and on stock llama.cpp once PR #23398 merges (it uses the same architecture shape). They are not compatible with the ik_llama variant heads (gemma4_mtp format).


2. PARALLEL=2 crash — root cause found and fixed

Running --n-parallel 2 with any of these heads was crashing with an assertion failure:

GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2)

in ggml_reshape_3d, called from llm_build_gemma4_mtp.

The root cause was a single line in gemma4-assistant.cpp:

```cpp // before (crashes at n_parallel=2): Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);

// fix: Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, 1); ```

The MTP draft step always processes exactly one token column regardless of how many server slots are active. Using n_tokens from the main forward pass worked fine with one slot (where n_tokens=1 anyway) but crashed as soon as a second slot fired its first draft step (n_tokens=2, element count mismatch).

The fix also requires LLAMA_PIPELINE_DEPTH2=0 as an env var on Vulkan to prevent thread queue deadlocks when two slots are active simultaneously.

Fix submitted to the Atomic fork: AtomicBot-ai/atomic-llama-cpp-turboquant#26.

The same bug is in the native llama.cpp PR #23398 — identical line, identical fix. Left a comment on that PR so it can be patched before merge.

Also worth calling out: u/janvitos independently did the same work for stock llama.cpp — PR #23398 adds Gemma 4 MTP support to the native build. Great minds think alike. Once that merges, the heads here will load on stock llama.cpp without needing the Atomic fork.


3. First 12B PARALLEL=2 bench (post-fix, Strix Halo / Vulkan)

Hardware: AMD Ryzen AI Max+ 395, 128 GB LPDDR5X unified, Vulkan/RADV (Mesa 25.2.8)

Metric 12B plain 2-slot 12B MTP PARALLEL=1 12B MTP PARALLEL=2
2-slot aggregate 47.6 tok/s 43.5 tok/s 62.5 tok/s
Single-stream decode 45.6 tok/s 38.6 tok/s (48.6 eff.)
MTP acceptance N/A 78.4% 88.6%
Wall time (1150-in/2000-out) 79.5 s 46.0 s 53.9 s

MTP PARALLEL=2 aggregate +31% over plain 2-slot. Per-slot decode drops (two slots share the same bandwidth), but total output throughput improves and acceptance actually went up relative to single-slot — the model is doing more useful speculation per pass when it has two requests to interleave.

26B-A4B PARALLEL=2 bench still running — expected to close or surpass the plain 2-slot 90.9 tok/s figure since the 26B-A4B has higher acceptance to start with.


Full numbers and context

Previous QAT numbers post (plain + single-slot MTP): Gemma 4 QAT Q4_0 bench on Strix Halo

Full benchmark data, reproducibility matrix, serve scripts: boxwrench/tesla_agent

HF model card has usage examples including the LLAMA_PIPELINE_DEPTH2=0 flag and the --mtp-draft-n 3 --draft-p-min 0.75 settings that gave these numbers.

submitted by /u/westsunset
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA