Introducing cyankiwi AWQ 4-bit Quantization — 26.05 update
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| In standard AWQ, per-channel scales and quantization ranges are picked in separate steps: scales first, then the quantization parameters. But they're not independent, i.e., the rounding error from one depends on the choice of the other, so optimizing them in sequence leaves quality on the table. Our cyankiwi AWQ 26.05 update jointly fits scales and quantization ranges against a reconstruction objective. We benchmarked cyankiwi AWQ 26.05 update against every major 4-bit method on Llama-3 as examples, measuring KL Divergence vs the BF16 baseline on GPQA Diamond responses. Result: cyankiwi posts the lowest KLD on all three base models. Lower is better. Llama-3.2-3B-Instruct
Llama-3.1-8B-Instruct
Llama-3.3-70B-Instruct
[link] [comments] |
More from r/LocalLLaMA
-
A First Comprehensive Study of TurboQuant: Accuracy and Performance
May 14
-
NVIDIA Reportedly Prepares RTX 5090 Price Hike Amid Rising GDDR7 Costs (maybe RTX 50 and PRO series as well)
May 14
-
Is there a big gap between Q4 and Q6 on Qwen3.6?
May 14
-
I tracked EU GPU prices across 15 stores for 50+ days - RTX 5090 is the only card not dropping in price
May 14
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.