I ported EXL3 to run well on Apple Silicon - PonyExl3
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Hi guys, Beam's here. After I revamped the chat interface in oMLX, I was playing with turboderp's exllamav3 in my RTX 4090 machine and I wonder why can't I run this on my M5/M1 Max - so I built one. https://github.com/beamivalice/PonyExl3 For those who don't know Exl3 - it's one of the best codec available for both quality/ram but trade the compute for it and it relies heavily on CUDA to make it work. Now it runs on Metal, and M5 Max can pull a respectable prefill ~600 tok/s and gen ~17 tok/s from Qwen3.6-27B model and with DFlash/MTP it went to ~38tok/s on greedy and around ~20-25 tok/s on normal temperature usage. For Qwen3.6-35B-A3B 4.00bpw prefill reaches as high as 2700 tok/s and decode surpassed my RTX 4090's ~50/tok to 68.5 tok/s and 80 tok/s with Eagle3 greedy mode. So How good was its quality/memory? - take a look at this chart, compiled by deepsweet and I ran mine (result still in txt file in the repo) Then I wired it all to my omlx and as it to generate polarbear picnic - boom 27B-exl3-4.15bpw on omlx with perfect polar bear. Cheers! [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.