moar QAT stuff and hairy ticks
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
tldr; finally got to a point where we can publish some of the ggufs with a more accurate process.
in these repos: https://huggingface.co/idkwhattoputherenow/gemma-4-12B-it-qat-q4_0-maxerr https://huggingface.co/idkwhattoputherenow/gemma-4-31B-it-qat-q4_0-maxerr
this is a followup to my og post: https://www.reddit.com/r/LocalLLaMA/comments/1u0marm/quick_note_on_the_qat_of_recent/
i still don't know what package google did their qat with or maybe i'm just too out of the loop and/or loopy to find out, but they probably did a version of Q4_0 with BF16 scales instead of F16, so my patch starts with two seeds that tend to be typical for q4 (to determine symmetric/asymmetric) and does a full round trip to f16 and computes the error. it takes the lower maxerr and searches around it until it starts to get worse. It actually ended up working better than using an imatrix and weighting the errors but i'm still testing that road. At least with these settings, it does end up with a similar kld as unsloth (UD-Q4_K_XL-super-mega-heccin-proprietary.gguf), which was the goal.
If anyone (not associated with unsloth) wants to pick up the pytorch and finish it for a pr, lmk and you can have the source with no limits or attribution required or wanted. maybe someone smarter than me can get it into a usable state without needing a whole separate fork and spoon. kinda curious if kimi does the same or if it is static.
worked surprisingly well with heretic but tbh, if you just do a normal quant to q4_0 with the --pure flag you get 90% of the way there, which google coulda/shoulda just done originally with their gguf release. did these from f32 cuz bf16 math isn't precise enough when the differences get small enough.
this process works with all of the g4qat models but the gap grows wider on the larger ones, assuming they accumulate errrs. think most ppl use the 31B so that's what I'm uploading even tho E4B is my preference because it works so well on old 4gb cards with vulkan.
If llmfan46 reads this, feel free to just put the quant up in your repo if you want and i'll take mine down, or tell me i'm just off my rocker xd
31B Mean KLD Same Top% RMS Δp% 95% KLD ---- -------- --------- ------- ------- heretic maxerr Q4_0 vs heretic F32 0.032453 93.954% 3.603% 0.110820 heretic stock Q4_0 (HF) vs heretic F32 0.100584 87.443% 5.985% 0.358515 heretic F32 vs original F32 0.073323 90.768% 5.449% 0.303116 heretic maxerr Q4_0 vs original F32 0.075877 90.649% 5.484% 0.312716 heretic stock Q4_0 (HF) vs original F32 0.133828 85.606% 7.095% 0.508320 original maxerr Q4_0 vs original F32 0.014023 96.610% 2.472% 0.032672 unsloth Q4_K_XL vs original F32 0.013952 96.649% 2.493% 0.034219 google Q4_0 vs original F32 0.093905 88.010% 5.783% 0.325671 12B Mean KLD Same Top% RMS Δp% 95% KLD ---- -------- --------- ------- ------- heretic maxerr Q4_0 vs heretic F32 0.146459 86.884% 7.438% 0.508752 heretic stock Q4_0 (HF) vs heretic F32 0.378834 77.502% 11.690% 1.420622 heretic F32 vs original F32 0.175292 82.815% 8.586% 0.612369 heretic maxerr Q4_0 vs original F32 0.235670 81.166% 9.652% 0.833584 heretic stock Q4_0 (HF) vs original F32 0.457296 74.188% 13.034% 1.704541 original maxerr Q4_0 vs original F32 0.129771 88.703% 7.047% 0.469628 unsloth Q4_K_XL vs original F32 0.136016 88.485% 7.162% 0.503085 google Q4_0 vs original F32 0.510035 73.775% 13.624% 1.944944 [link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.