Has there been any recent new development on which quant is considered optimal?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I recall in earlier days, q4 was said to be optimal.
That is to say, if you have a:
small q8 model
medium q4 model
large q2
Assuming they use the same amount of GPU VRAM, medium q4 would be the best-performing model.
I also know that Apple (crazy that I am citing Apple here, given how secretive they tend to be) was quite public about using q4 quant models for thier on device.
[link] [comments]
More from r/LocalLLaMA
-
People from r/antiai must be barbaric
Jun 6
-
A cooling chamber for dgx spark and gb10 machines at computex 2026
Jun 6
-
Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding
Jun 6
-
Qwen 3.6 27B MTP - Adding spec-type and spec-draft-n-max is dropping tps and reducing GPU utilization
Jun 6
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.