QAT MTP Heads Upload + PARALLEL=2 Fix + 12B 2-slot Bench
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Title: Gemma 4 QAT MTP assistant heads now public on HuggingFace + PARALLEL=2 crash fix + 12B 2-slot bench (Strix Halo / Vulkan)
Three things in one update: the converted QAT-matched draft heads are now uploaded for anyone to use, we found and fixed the PARALLEL=2 crash in both the Atomic fork and filed the same bug on the native llama.cpp PR, and here are the first 12B 2-slot numbers after the fix.
1. QAT-matched MTP heads are now on HuggingFace
boxwrench/gemma-4-qat-mtp-assistant-heads
Three draft heads for speculative decoding (MTP) with the official Gemma 4 QAT Q4_0 models:
| File | Pairs with | Size |
|---|---|---|
gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf | google/gemma-4-12B-it-qat-q4_0 | 444 MiB |
gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf | google/gemma-4-26B-A4B-it-qat-q4_0 | 441 MiB |
gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf | google/gemma-4-31B-it-qat-q4_0 | 491 MiB |
Converted from Google's official unquantized QAT assistant checkpoints (google/gemma-4-{12B,26B-A4B,31B}-it-qat-q4_0-unquantized-assistant) to gemma4_assistant GGUF Q8_0.
Why QAT-matched heads matter: A draft head guesses tokens ahead of the main model. If the head was trained against full-precision weights but the main model is a QAT quantization, their distributions diverge — the head guesses what the full-precision model would have said, and the QAT model disagrees more often. Using heads that were trained against the same QAT checkpoint closes that gap substantially:
| Model | Non-QAT head | QAT-matched head | Change |
|---|---|---|---|
| 12B QAT Q4_0 | 71.3% | 78.4% | +7 pp |
| 26B-A4B QAT Q4_0 | 56.9% | 91.8% | +35 pp |
| 31B QAT Q4_0 | 42.5% | 60.4% | +18 pp |
The 26B-A4B gap was especially stark — nearly 35 percentage points of acceptance rate were being lost purely to the head mismatch.
Compatibility: These use the gemma4_assistant architecture. They load on the Atomic TurboQuant fork now, and on stock llama.cpp once PR #23398 merges (it uses the same architecture shape). They are not compatible with the ik_llama variant heads (gemma4_mtp format).
2. PARALLEL=2 crash — root cause found and fixed
Running --n-parallel 2 with any of these heads was crashing with an assertion failure:
GGML_ASSERT(ggml_nelements(a) == ne0*ne1*ne2)
in ggml_reshape_3d, called from llm_build_gemma4_mtp.
The root cause was a single line in gemma4-assistant.cpp:
```cpp // before (crashes at n_parallel=2): Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens);
// fix: Qcur = ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, 1); ```
The MTP draft step always processes exactly one token column regardless of how many server slots are active. Using n_tokens from the main forward pass worked fine with one slot (where n_tokens=1 anyway) but crashed as soon as a second slot fired its first draft step (n_tokens=2, element count mismatch).
The fix also requires LLAMA_PIPELINE_DEPTH2=0 as an env var on Vulkan to prevent thread queue deadlocks when two slots are active simultaneously.
Fix submitted to the Atomic fork: AtomicBot-ai/atomic-llama-cpp-turboquant#26.
The same bug is in the native llama.cpp PR #23398 — identical line, identical fix. Left a comment on that PR so it can be patched before merge.
Also worth calling out: u/janvitos independently did the same work for stock llama.cpp — PR #23398 adds Gemma 4 MTP support to the native build. Great minds think alike. Once that merges, the heads here will load on stock llama.cpp without needing the Atomic fork.
3. First 12B PARALLEL=2 bench (post-fix, Strix Halo / Vulkan)
Hardware: AMD Ryzen AI Max+ 395, 128 GB LPDDR5X unified, Vulkan/RADV (Mesa 25.2.8)
| Metric | 12B plain 2-slot | 12B MTP PARALLEL=1 | 12B MTP PARALLEL=2 |
|---|---|---|---|
| 2-slot aggregate | 47.6 tok/s | 43.5 tok/s | 62.5 tok/s |
| Single-stream decode | — | 45.6 tok/s | 38.6 tok/s (48.6 eff.) |
| MTP acceptance | N/A | 78.4% | 88.6% |
| Wall time (1150-in/2000-out) | 79.5 s | 46.0 s | 53.9 s |
MTP PARALLEL=2 aggregate +31% over plain 2-slot. Per-slot decode drops (two slots share the same bandwidth), but total output throughput improves and acceptance actually went up relative to single-slot — the model is doing more useful speculation per pass when it has two requests to interleave.
26B-A4B PARALLEL=2 bench still running — expected to close or surpass the plain 2-slot 90.9 tok/s figure since the 26B-A4B has higher acceptance to start with.
Full numbers and context
Previous QAT numbers post (plain + single-slot MTP): Gemma 4 QAT Q4_0 bench on Strix Halo
Full benchmark data, reproducibility matrix, serve scripts: boxwrench/tesla_agent
HF model card has usage examples including the LLAMA_PIPELINE_DEPTH2=0 flag and the --mtp-draft-n 3 --draft-p-min 0.75 settings that gave these numbers.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.