r/LocalLLaMA · · 1 min read

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

We added W8A8 activation quantization to MLX — prefill went from 2.84s to 2.52s on M5 Pro

Hey, I work on inference tooling at Mininglamp AI. We needed faster prefill for a 4B VLM running on Apple Silicon. Problem was MLX only does weight-only quant — activations stay FP16 the whole way through. So we wrote Cider, a small SDK that adds W8A8 activation quant on top of MLX.

Numbers on M5 Pro (64GB, 307 GB/s), 4516 token context:

Quantization Prefill Decode
W8A16 (MLX) 2.839s 80.1 tok/s
W8A8 (Cider) 2.519s 79.5 tok/s

Under the hood it's custom Metal kernels we registered as MLX primitives. At M=4096 the per-channel path runs 1.84x faster than W8A16 on the same shape. Not just for our model btw — works with anything that runs through MLX.

One catch: INT8 TensorOps only compile on M5 and above. pip install on M4 still works, just falls back to the regular path.

Repo: https://github.com/Mininglamp-AI/cider

Edit: adding accuracy numbers since it came up. Wikitext2 PPL on Qwen3-8B: FP16 9.73, W8A16 9.71, W8A8 per-channel 9.76. Llama3-8B: FP16 6.14, W8A16 6.15, W8A8 per-channel 6.27. Per-group gs=64 keeps it tighter if precision matters more than speed for your use case.

submitted by /u/Enough-Astronaut9278
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA