Gemma 4 26B A4B IT QAT Comparison
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hopefully this isn't too low effort of a post. I just finished the benchmarks and I figured I'd post them online because they certainly were insightful for me. I did not use any AI other than asking Gemini 3.1 Pro if it was statistically significant because I was too tired to do inferential statistics.
Methodology:
oMLX used to run Gemma 4 26BA4B IT from mlx-community. I used the following models:
Gemma 26B 4 Bit: https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-4bit
Gemma 26B 6 Bit: https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-6bit
Gemma 26B QAT 8 Bit: https://huggingface.co/mlx-community/gemma-4-26B-A4B-it-qat-8bit
I ran them on a Macbook M5 Pro 64GB with oMLX on version 0.4.1 and unquantized kv cache, and thinking enabled.
I ran the following tests on all models: 50 MMLU_PRO questions, and 100 HumanEval questions.
The only difference in the chat templates between all of those models above relates to multimodal tool calls, so it did not impact the results. Additionally, they were all quantized using the same method, so the only variable should be the original model weights.
I chose the 8 bit QAT to avoid confounding variables from any mlx specific quantization damage. My goal was to compare the QAT model as close to the original as possible to the original model. This model should be virtually identical to the unsloth q4_k_xl quant of the QAT model. (I mean legitimately very close to identical, not "TQ4 is basically BF16 identical")
I chose to compare it to a mlx 4 bit and 6 bit quant, as both bpw ranges are within the range that users have expressed uncertainty about replacing their old quant with a new QAT model.
Results:
| Model | Benchmark | Percentage (Correct/Total) |
|---|---|---|
| Gemma 4 26B IT 4 Bit | MMLU_PRO | 56.0% (28/50) |
| Gemma 4 26B IT 4 Bit | HUMANEVAL | 90.0% (90/100) |
| Gemma 4 26B IT 6 Bit | MMLU_PRO | 58.0% (29/50) |
| Gemma 4 26B IT 6 Bit | HUMANEVAL | 98.0% (98/100) |
| Gemma 4 26B IT QAT 8 Bit | MMLU_PRO | 52.0% (26/50) |
| Gemma 4 26B IT QAT 8 Bit | HUMANEVAL | 90.0% (90/100) |
Interpretation:
Both chi-squared tests and z tests were performed by Gemini.
The only statistically convincing evidence of a difference across all these benchmarks is that the QAT 8 Bit model performs worse than the 6 Bit model on HUMANEVAL. The performance differences seen on MMLU_PRO are not statistically significant and can be attributed to random chance due to the smaller sample size (50 questions).
Thus the conclusion that I have reached is that the QAT model is worse than a Q6 quant of the original model. This means that the claim that "QAT is indistinguishable from BF16" or "the distributions are very close" is likely wrong, as the full QAT model is unlikely to beat the tested 8 bit model, but the full non-QAT model is very likely to beat the q6 model, meaning a wider gap than I was able to produce is likely present.
QAT was not clearly better or worse than a regular MLX q4 quant. Now, for GGUF, QAT likely still smashes Q4_0 out of the park and might even be competitive with IQ4_XS, but it seems that the assumption that q4_k, q5, and even q6 quants should be replaced with QAT quants is a bit early.
I might run more tests on the 26B, or even test out the 31B model later, as the sample sizes that I have are just enough to begin to get an idea.
Creative writing may be different, but I mainly wanted to measure similarity with the original model, and worse benchmark performance is by definition indicative of dissimilarity.
Also this is a MoE, and so maybe the QAT works better on the 31B.
Tldr; Gemma 4 QAT unquantized is inferior to Gemma 4 unquantized and so it might not make sense to replace 5, 6, or even dynamic 4 bit quants with Gemma 4 26B QAT. These observations may not generalize to the 31B, 12B, or E2/4B.
[link] [comments]
More from r/LocalLLaMA
-
Have we reached the point where open-source LLMs are “just good enough”?
Jun 9
-
Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants?
Jun 9
-
ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp
Jun 9
-
2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all…
Jun 9
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.