Are these quants of QAT better than non-QAT? What do I use?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-i1-GGUF/tree/main
https://huggingface.co/mradermacher/gemma-4-31B-it-qat-q4_0-unquantized-GGUF/tree/main
I waited a bit before asking this. I have 3060 12GB and 32GB ddr3 RAM. I'm currently using an old version of unsloth's gemma-4-31B-it-UD-IQ3_XXS.gguf which is 11.8GB. With override ffn_down tensors, I can run 16k bf16 context at about 1.3 tk/s last time I used it. When I use the bf16 mmproj I offload it to CPU. Overriding more tensors lets me go to 32k context.
I saw that there are even Q2-Q3 quants of the new QAT Gemma 31B in the two links above. Are these better than the model I have right now due to them being QAT? What quant should I get?
How low can I get? I want to use MTP if possible, and need advice on what model I need in that regard too, as I saw the assistant models have quants too. Or would MTP just ultimately slow me down if it requires context to be offloaded to CPU for space?
I heard the i Quants are slower on CPU, so should I use the Q2_K in the second link? Or should I use one of the smaller quants in the first link if it's possible to use MTP and context on GPU?
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.