What's your experience with Gemma4 QAT?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hey everyone!
Not a native speaker, so please correct my english where I make mistakes, (can only learn from it!).
While it's been out only for just a while, I wanted to post about it because it's been such a joy.
So, to say upfront: I use Qwen3.6 27B for programming, Gemma4 for basically everything else. So I can't say anything meaningful about programming.
Previously I've used Gemma4-31B Q4_K_L (for long 128k Q8_0 context tasks) and Q6_K_L (for short 32k Q8_0 context tasks). For short context tasks, think quick translations, roleplaying, short but accurate OCR, etc. For long context think long-document parsing, websearch research, etc.
With the QAT model, I've been able to use the same model for both tasks (nice!) and notice subtle quality improvements.
With roleplay for example, it has much more varied word use, more context relevant remarks, understand corrolations better and able to use it, etc.
Sadly I have no experience with the Q8_0 model, but from what I can tell it performs at least better than Q6_K_L from bartowski. It is however still severely hampered by cache quant, Q8_0 does show a noticable degration for me at 128K.
Using MTP with Gemma 31B QAT has been amazing too! I get 50 t/s tg (opposed to 21 t/s) for 32k tokens wikipedia page summerization, ~36 t/s tg during roleplay (opposed to 20 t/s), and you likely can get higher numbers on linux (stuck with windows for now...).
I had to dial it in though, 5 max drafts seemed to work well for me, but for my friends 4 or 6 worked better for them. Try 3-7 in 5 separate runs for the same task and see wich one runs best for you.
So yeah, enough about my experiences! How was yours? Do you notice any improvement or degration when using the QAT models? And what is programming like on it?
[link] [comments]
More from r/LocalLLaMA
-
Galaxy Z Fold6 as a local inference node — llama.cpp/Vulkan, homelab telemetry, SHA-256 model verification
Jun 8
-
llama-server router: a model pinned to one GPU still grabs a CUDA context on every card, so it OOMs when my others are full. Am I missing a flag or is this just how it is?
Jun 7
-
Qwen 3.6 27B on DeepSWE
Jun 7
-
2-bit QAT model releases
Jun 7
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.