Anyone seen benchmarks comparing Gemma 4 4-bit QAT vs. 8-bit standard quants?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I'm trying to find out if anyone has done any benchmarking comparing the Gemma 4 4-bit QAT models (via Unsloth) against standard 8-bit non-QAT quants.
I know QAT is supposed to retain a ton of accuracy compared to the baseline BF16, but I'm curious how a 4-bit QAT model actually fares against a traditional 8-bit PTQ. I've read some mixed feedback across different threads, but I haven't been able to find hard numbers or a direct head to head comparison between the two.
Has anyone run any evaluations on this yet?
[link] [comments]
More from r/LocalLLaMA
-
ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp
Jun 9
-
2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all…
Jun 9
-
Jetbrains Mellum 2: a really good and performant model
Jun 9
-
I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU
Jun 9
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.