r/LocalLLaMA · · 2 min read

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

TL;DR I spent a few crazy evenings this past week seeing if I could get Gemma4 running with proper turbo quant and rotating KV cache support. The answer was yes, and I'm now able to run Gemma4 26b on my MacBook Air M5 at 128k context with 4 concurrent batches 😄

At 8k context running with no mmap, it beats llama.cpp at prompt processing, generation speed, and runtime memory:

backend model bpw pp tok/s gen tok/s runtime mem
llama.cpp IQ4_XS + q4_0 KV + flash-attn 4.25 260.6 14.66 16.0 GB
MLX (ours) nvfp4 + polar2 4.5 348.4 17.15 15.22 GB

It took a lot of hand-tuning to get to this speed, including a custom kernel for the SWA layers in order to get the actual runtime 2bit memory savings that enable higher batch sizes whilst staying close to full fp16 prompt processing speed

The prompt processing speed scales relatively well with batch size - the major gains though are in the text generation - running a 512 token-long prompt on a 32 GB M5:

B pp tok/s gen tok/s
1 353 16.0
4 429 24.9
8 451 32.4
16 451 44.2
32 450 48.0
64 448 54.6
128 440 54.0

If you want to download it and serve it yourself, you can open a terminal to a directory you want to download the repo to and run:

git clone https://github.com/lovelacemadeline/gemma4-turboquant-mlx 

And then if you have uv installed (can also be done with pip3 but I prefer uv) you then:

cd gemma4-turboquant-mlx uv tool install --from . --reinstall gemma4-turboquant-mlx 

Then once it's installed, you can spin up the backend with:

mlx_lm.server --model mlx-community/gemma-4-26b-a4b-it-nvfp4 

And it should work 😄

(Note that if you're running on a Mac with 16 GB of RAM then you'll need to do the wired memory hack in order to get most quants of the Gemma MoE model running - I've included instructions for that in the repo)

submitted by /u/maddie-lovelace
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA