r/LocalLLaMA · · 1 min read

Has there been any recent new development on which quant is considered optimal?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I recall in earlier days, q4 was said to be optimal.

That is to say, if you have a:

small q8 model
medium q4 model
large q2

Assuming they use the same amount of GPU VRAM, medium q4 would be the best-performing model.

I also know that Apple (crazy that I am citing Apple here, given how secretive they tend to be) was quite public about using q4 quant models for thier on device.

submitted by /u/takuonline
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA