Gemma 4 31B QAT GGUF loads with MTP branch, but outputs repeated <unused49> - any working recipe?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Update: you were right to suggest checking the hash.
My cached GGUF blob was corrupt. HF expected SHA256:
9188a71055550f1e60b875d02b7abb63625ac11b4a6f148d6b22b3b28ba3d335
My old local blob hashed to:
20e9ffda0c1a0fb5b6ed9cc445834e5c3e98a1f9ffe4a64edf319cbd0aa85fba
I moved the blob aside, force-redownloaded with hf download --force-download, and rebuilt latest llama.cpp master after the Gemma 4 MTP merge.
Result: main 31B QAT GGUF now works. No more repeated <unused49>.
Tested with: - llama.cpp master f0156d140 - gemma-4-31B-it-qat-UD-Q4_K_XL.gguf - RTX 5090 32GB - --ctx-size 40960 - --cache-type-k q8_0 - --cache-type-v q8_0 - --flash-attn on
VRAM is about 21.5 GB and a direct chat test returns clean text.
MTP assistant still does not work with my local assistant GGUF because of metadata/assertion issues, but the main long-context QAT model is alive.
Thank you for the hash tip. That was the key.
I’m trying to run:
unsloth/gemma-4-31B-it-qat-GGUF
gemma-4-31B-it-qat-UD-Q4_K_XL.gguf
on an RTX 5090 32GB using llama.cpp Gemma 4 MTP PR branch.
Main model loads. Without the MTP assistant head, /v1/chat/completions returns repeated <unused49>.
I also tried the public MTP assistant head:
boxwrench/gemma-4-qat-mtp-assistant-heads
gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf
That file needed some local compatibility fixes because the loader expected:
- gemma4-assistant but the GGUF uses gemma4_assistant
- embedding_length_out but the GGUF has n_embd_backbone = 5376
- nextn_predict_layers but the GGUF has block_count = 4
- nextn.pre_projection / nextn.post_projection but the GGUF tensors are mtp.pre_projection / mtp.post_projection
After patching those locally, the model and draft head load and draft-mtp initializes, but generation still returns repeated <unused49>. Timings show generation is active, but draft_n_accepted = 0.
Example:
content: "<unused49><unused49><unused49>..."
draft_n: 242
draft_n_accepted: 0
Command shape:
llama-server \
--model gemma-4-31B-it-qat-UD-Q4_K_XL.gguf \
--model-draft gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf \
--spec-type draft-mtp \
--spec-draft-n-max 4 \
--ctx-size 4096 \
-np 1 \
--jinja \
--reasoning off
Also tried reasoning on, built-in gemma template override, and no draft model. Same <unused49> output.
Has anyone successfully run the 31B QAT GGUF specifically, not only 12B QAT? If yes, which exact llama.cpp commit/fork/assistant-head file/command are you using?
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.