Quality evaluation of quants with limited time or tokens
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
About a year ago, people were publishing a lot of benchmarks about various quants of models. I understand that it is not really feasible with the current (and other welcome) frequent releases of new models, but on the other side, it may be still useful to know locally whether q3 of this model is better than q6 of that model.
I've checked a few benchmarks, but it seems they are versatile, and the models may generate millions of tokens, which, with a 300b+ moe model on a home setup of 10-20 t/s seems to be not feasible to benchmark. I'd rather have a benchmark where I could limit the focus to the tasks that provide the most predictive power (e.g. tasks that may pass on q6 but may fail on q5).
Of course there is always the DIY approach, but I am wondering if people have already tackled this problem somehow. I'd even settle if there were an automatic way to describe that q5 is roughly 95.56% of q8, or something along those lines.
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.