r/LocalLLaMA · · 1 min read

Are these quants of QAT better than non-QAT? What do I use?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-i1-GGUF/tree/main

https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-GGUF/tree/main

I waited a bit before asking this. I have 3060 12GB and 32GB ddr3 RAM. I'm currently using an old version of unsloth's gemma-4-31B-it-UD-IQ3_XXS.gguf which is 11.8GB. With override ffn_down tensors, I can run 16k bf16 context at about 1.3 tk/s last time I used it. When I use the bf16 mmproj I offload it to CPU. Overriding more tensors lets me go to 32k context.

I saw that there are even Q2-Q3 quants of the new QAT Gemma 31B in the two links above. Are these better than the model I have right now due to them being QAT? What quant should I get?

How low can I get? I want to use MTP if possible, and need advice on what model I need in that regard too, as I saw the assistant models have quants too. Or would MTP just ultimately slow me down if it requires context to be offloaded to CPU for space?

I heard the i Quants are slower on CPU, so should I use the Q2_K in the second link? Or should I use one of the smaller quants in the first link if it's possible to use MTP and context on GPU?

submitted by /u/ThrowawayProgress99
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA