r/LocalLLaMA · · 1 min read

Quick note on the QAT of recent

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

tldr: Googles quant is broken, use unsloth UD Q4_K_XL for now

This might be low quality post, but oh well, we ball

llama-quantize will quant the token embed to q6k when Google really was supposed to use "--pure" but that’s only the first problem

The llama-quantize quant function is hardcoded to -7 when SOME groups are actually optimized for 8

The 32 block groups are misaligned which causes them to intermingle, so they just need to be sorted and quantized separately

unsloth Q4_k_xl is misleading because it is actually pure q4_0 as (it should!)

The bf16/f16 scale they refer to is negligible but still necessary on the quest for perfection.

Working on a patch but someone else might have it submitted sooner. Comes pretty much within margin of error; I assume unsloth just wants to keep their process hidden.

submitted by /u/dreamkast06
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA