Quick note on the QAT of recent
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
tldr: Googles quant is broken, use unsloth UD Q4_K_XL for now
This might be low quality post, but oh well, we ball
llama-quantize will quant the token embed to q6k when Google really was supposed to use "--pure" but that’s only the first problem
The llama-quantize quant function is hardcoded to -7 when SOME groups are actually optimized for 8
The 32 block groups are misaligned which causes them to intermingle, so they just need to be sorted and quantized separately
unsloth Q4_k_xl is misleading because it is actually pure q4_0 as (it should!)
The bf16/f16 scale they refer to is negligible but still necessary on the quest for perfection.
Working on a patch but someone else might have it submitted sooner. Comes pretty much within margin of error; I assume unsloth just wants to keep their process hidden.
[link] [comments]
More from r/LocalLLaMA
-
mtp: support for gemma-4 E2B and E4B assistants by max-krasnyansky · Pull Request #24282 · ggml-org/llama.cpp
Jun 8
-
Me: Arguing with an AI bot who just posted something on this sub about Llama 3.1.
Jun 8
-
Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance
Jun 8
-
Was BitNet a dead end? What happened to ternary LLMs?
Jun 8
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.