r/LocalLLaMA · · 1 min read

I ported EXL3 to run well on Apple Silicon - PonyExl3

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I ported EXL3 to run well on Apple Silicon - PonyExl3

Hi guys, Beam's here. After I revamped the chat interface in oMLX, I was playing with turboderp's exllamav3 in my RTX 4090 machine and I wonder why can't I run this on my M5/M1 Max - so I built one.

https://github.com/beamivalice/PonyExl3

For those who don't know Exl3 - it's one of the best codec available for both quality/ram but trade the compute for it and it relies heavily on CUDA to make it work. Now it runs on Metal, and M5 Max can pull a respectable prefill ~600 tok/s and gen ~17 tok/s from Qwen3.6-27B model and with DFlash/MTP it went to ~38tok/s on greedy and around ~20-25 tok/s on normal temperature usage.

For Qwen3.6-35B-A3B 4.00bpw prefill reaches as high as 2700 tok/s and decode surpassed my RTX 4090's ~50/tok to 68.5 tok/s and 80 tok/s with Eagle3 greedy mode.

So How good was its quality/memory? - take a look at this chart, compiled by deepsweet and I ran mine (result still in txt file in the repo)

https://preview.redd.it/t3z3w078vd7h1.png?width=1200&format=png&auto=webp&s=e2127e9c95ea3a250c98ddcc81ec5dd5027a6370

https://preview.redd.it/avf5ja3avd7h1.png?width=1202&format=png&auto=webp&s=e60712b1f2ab80ac0851569a0ec70b34680babf1

Then I wired it all to my omlx and as it to generate polarbear picnic - boom 27B-exl3-4.15bpw on omlx with perfect polar bear.

https://preview.redd.it/g3qruvbzvd7h1.png?width=2750&format=png&auto=webp&s=1fc19170960ef62839ceee503ff6b4df12ec10ef

Cheers!

submitted by /u/Beamsters
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA