In Q8_0 weight quantization, why can't we just skip blocks of 32 that have very large outliers?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Looking for someone with an expert-level understanding.
I understand that we can skip layers and sub-layers when doing quantization, but why can't we skip blocks? I am using Q8_0 as it's a simple example. Every block of 32 values has a scale. If we find that at least 1/32 values meets the criteria of having an outlier, do not quant down the block. Leave it at the native value since the math is all done with the native value anyway.
When I look at the quantized sub layers of a GGUF model in Q8_0, it seems that this method would have a significant effect on the final accuracy. Less than 1% of each sub-layer would need to be skipped.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.