r/LocalLLaMA · · 2 min read

Gemma 4 31B QAT GGUF loads with MTP branch, but outputs repeated <unused49> - any working recipe?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Update: you were right to suggest checking the hash.

My cached GGUF blob was corrupt. HF expected SHA256:

9188a71055550f1e60b875d02b7abb63625ac11b4a6f148d6b22b3b28ba3d335

My old local blob hashed to:

20e9ffda0c1a0fb5b6ed9cc445834e5c3e98a1f9ffe4a64edf319cbd0aa85fba

I moved the blob aside, force-redownloaded with hf download --force-download, and rebuilt latest llama.cpp master after the Gemma 4 MTP merge.

Result: main 31B QAT GGUF now works. No more repeated <unused49>.

Tested with: - llama.cpp master f0156d140 - gemma-4-31B-it-qat-UD-Q4_K_XL.gguf - RTX 5090 32GB - --ctx-size 40960 - --cache-type-k q8_0 - --cache-type-v q8_0 - --flash-attn on

VRAM is about 21.5 GB and a direct chat test returns clean text.

MTP assistant still does not work with my local assistant GGUF because of metadata/assertion issues, but the main long-context QAT model is alive.

Thank you for the hash tip. That was the key.


I’m trying to run:

unsloth/gemma-4-31B-it-qat-GGUF

gemma-4-31B-it-qat-UD-Q4_K_XL.gguf

on an RTX 5090 32GB using llama.cpp Gemma 4 MTP PR branch.

Main model loads. Without the MTP assistant head, /v1/chat/completions returns repeated <unused49>.

I also tried the public MTP assistant head:

boxwrench/gemma-4-qat-mtp-assistant-heads

gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf

That file needed some local compatibility fixes because the loader expected:

- gemma4-assistant but the GGUF uses gemma4_assistant

- embedding_length_out but the GGUF has n_embd_backbone = 5376

- nextn_predict_layers but the GGUF has block_count = 4

- nextn.pre_projection / nextn.post_projection but the GGUF tensors are mtp.pre_projection / mtp.post_projection

After patching those locally, the model and draft head load and draft-mtp initializes, but generation still returns repeated <unused49>. Timings show generation is active, but draft_n_accepted = 0.

Example:

content: "<unused49><unused49><unused49>..."

draft_n: 242

draft_n_accepted: 0

Command shape:

llama-server \

--model gemma-4-31B-it-qat-UD-Q4_K_XL.gguf \

--model-draft gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf \

--spec-type draft-mtp \

--spec-draft-n-max 4 \

--ctx-size 4096 \

-np 1 \

--jinja \

--reasoning off

Also tried reasoning on, built-in gemma template override, and no draft model. Same <unused49> output.

Has anyone successfully run the 31B QAT GGUF specifically, not only 12B QAT? If yes, which exact llama.cpp commit/fork/assistant-head file/command are you using?

submitted by /u/WaveformEntropy
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA