Gemma 4 31B QAT Q4 vs standard Q4 — Top1 KLD benchmark results have me confused. Someone please explain or poke holes in this.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Edited -
After digging into this some more and reviewing unsloth post for better understanding, the divergence APPEARS to stem from I did not use the BF16 QAT model as the "reference" model....
The QAT vs standard Q4 comparison in our benchmark is not apples-to-apples. The QAT models were evaluated against a reference they were never optimised toward. The standard Q4_0 and Q4_K_M comparison is valid. The "QAT is worse" conclusion needs a big asterisk: we can't actually tell how good the QAT models are because we didn't have the right reference.
re-running the QAT with the QAT Bf16 model
-- original below:
I'll be upfront: I vibe-benched and vibe-reported this with Claude Sonnet 4.6, but I reviewed and edited everything before posting (too lazy to take out all the AI EM dash —), so hopefully nobody considers this AI slop. And more importantly, I genuinely don't understand why I'm getting these counter-intuitive results, so I'm hoping the community can either explain it or tell me what I did wrong.
Background
One of the local LLMs I run is entirely on CPU the Gemma 4 31B model at Q8, as I can't afford the quality loss that comes with dropping to Q3 to fit on my 16GB GPU. My setup is dual Xeon Platinum 8358 (128 threads), 256 GB DDR4. Gemma 4 31B Q8_0 sits at around 4 t/s generation... slow, but it earns its keep on quality-sensitive workloads where I need the model to reason carefully over long, dense text for background/overnight type job where I don't need the speed but need the smart and accuracy.
The new QAT Q4 models are appealing: 17 GB vs 32 GB, roughly double the generation speed on bandwidth-limited hardware. Google released the checkpoints without publishing any quantitative accuracy comparisons. Unsloth published their own numbers (96.67% top-1 vs BF16) which looked promising. I wanted something expressed as KLD — the same metric LocalBench uses — so I ran my own benchmark.
What I did not expect: standard Q4_0 beats QAT Q4_0. By a lot. And Q4_K_M beats everything. I have no good explanation for this and I'm hoping someone does.
Why first 5,000 tokens and not the full wikitext-2 test set?
The full set is ~245,000 tokens. On CPU at ~4 t/s for Q8_0, a full stride-1 evaluation runs roughly 13 hours for all models. Instead: first 5,000 tokens, stride 5, ~820 sample positions per model. Reproducible — same file, same parameters, same result.
Are the results deterministic? Yes — each model ran 3 times. Std dev was ±0.00% across all runs. Temperature=0 + CPU inference is perfectly deterministic. So 3 runs confirmed this isn't noise.
Inference engine
Mainline llama.cpp (llama-xeon8358 image). Run flags: numactl --interleave=all, --numa distribute, --threads 64, --no-mmap --mlock. KV cache forced to f16 for all models — isolates weight quantization quality only, no KV noise mixed in. (Production uses the IK_LLama fork for its Xeon-optimised kernels, but it has an FA assertion bug at large sliding-window contexts so mainline was used here — same GGUF files, same math.)
Models tested
| Repo | File | Size |
|---|---|---|
| Reference | bartowski/google_gemma-4-31B-it-GGUF | google_gemma-4-31B-it-Q8_0.gguf |
| Google QAT Q4_0 | google/gemma-4-31B-it-qat-q4_0-gguf | gemma-4-31B_q4_0-it.gguf |
| Unsloth QAT UD-Q4_K_XL | unsloth/gemma-4-31B-it-qat-GGUF | gemma-4-31B-it-qat-UD-Q4_K_XL.gguf |
| Unsloth Q4_0 (standard) | unsloth/gemma-4-31B-it-GGUF | gemma-4-31B-it-Q4_0.gguf |
| Unsloth Q4_K_M | unsloth/gemma-4-31B-it-GGUF | gemma-4-31B-it-Q4_K_M.gguf |
Q8_0 used as reference — well-established proxy for BF16 at this model size and quant level.
Methodology
- Top-1 accuracy — does the quantized model pick the same most-likely next token as Q8_0?
- Mean KLD — KL divergence of top-40 token distribution vs Q8_0, token by token
- Both metrics computed against the same fixed Q8_0 reference run for all models
- 3 runs per model confirmed zero variance (fully deterministic)
Results — wikitext-2 (reproducible)
wikitext-2-raw-v1 test set, first 5,000 tokens, stride 5. Wikipedia-style prose only.
| Model | Top-1 acc | Mean KLD |
|---|---|---|
| Google QAT Q4_0 | 50.43% | 3.447 |
| Unsloth QAT UD-Q4_K_XL | 51.40% | 3.397 |
| Unsloth Q4_0 (standard) | 61.54% | 2.619 |
| Unsloth Q4_K_M | 66.06% | 2.304 |
Results — custom task categories (informational, not reproducible)
Hand-written test strings. Not a standard dataset — directional only.
From the benchmark output:
| Category | G-QAT acc | G-QAT KLD | U-QAT acc | U-QAT KLD | Q4_0 acc | Q4_0 KLD | Q4_K_M acc | Q4_K_M KLD |
|---|---|---|---|---|---|---|---|---|
| code | 92.31% | 0.460 | 92.31% | 0.458 | 97.44% | 0.049 | 94.87% | 0.025 |
| science | 55.56% | 1.218 | 55.56% | 1.293 | 80.56% | 0.300 | 77.78% | 0.396 |
| chat | 63.64% | 1.604 | 63.64% | 1.532 | 95.45% | 0.097 | 90.91% | 0.120 |
| tool_call | 77.78% | 1.036 | 70.37% | 1.105 | 92.59% | 0.299 | 96.30% | 0.250 |
| long_doc | 28.57% | 2.438 | 28.57% | 2.682 | 65.71% | 1.302 | 77.14% | 1.081 |
| overall | 52.56% | 3.101 | 53.17% | 3.071 | 65.44% | 2.263 | 69.43% | 1.993 |
The result that has me confused
Standard Q4_0 beats QAT Q4_0 by ~13% top-1 accuracy. And Q4_K_M beats both.
QAT is supposed to close the gap between Q4_0 and the reference by training the model to tolerate quantization noise. Google put real effort into this — they ran actual fine-tuning specifically for the Q4_0 format. Unsloth's UD-Q4_K_XL applies their Dynamic 2.0 method on top of the QAT checkpoint. By every account these should be better than a naively quantized Q4_0.
But they're not — at least not against a Q8_0 reference on wikitext-2 and these task categories.
My best guess: QAT Q4_0 is still flat uniform 4-bit quantization. The QAT process may reduce quantization error relative to naive Q4_0 — but Q4_K_M is a fundamentally different format that allocates more bits to sensitive layers. The K-quant format advantage might simply outweigh the QAT training benefit. But I'd expect someone who actually understands quantization internals to tell me if that reasoning is sound or completely wrong.
What I'd like to know:
- Is comparing QAT Q4_0 against standard Q4_0 using Q8_0 as reference the right methodology, or does this introduce a systematic bias that favors Q4_K_M?
- Does the QAT training actually make Q4_0 better than naive Q4_0, just not better than K-quants — or is something else going on?
- Is there a flaw in the sliding-window logprob approach that would explain this?
What I do know: for my use case — dense factual prose, technical documents, long-form reasoning — the long_doc numbers tell the story. QAT Q4_0 drops to 28.57% top-1 vs Q8_0. Q4_K_M holds at 77%. Q8_0 stays.
Benchmark was ran in ~2 hours runtime for 4 models × 3 runs on this hardware.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.