r/LocalLLaMA · May 15, 2026 · 2 min read

Gemma4 26b MoE running in MLX with turboquant (and custom kernel)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

TL;DR I spent a few crazy evenings this past week seeing if I could get Gemma4 running with proper turbo quant and rotating KV cache support. The answer was yes, and I'm now able to run Gemma4 26b on my MacBook Air M5 at 128k context with 4 concurrent batches 😄

At 8k context running with no mmap, it beats llama.cpp at prompt processing, generation speed, and runtime memory:

backend	model	bpw	pp tok/s	gen tok/s	runtime mem
llama.cpp	IQ4_XS + q4_0 KV + flash-attn	4.25	260.6	14.66	16.0 GB
MLX (ours)	nvfp4 + polar2	4.5	348.4	17.15	15.22 GB

It took a lot of hand-tuning to get to this speed, including a custom kernel for the SWA layers in order to get the actual runtime 2bit memory savings that enable higher batch sizes whilst staying close to full fp16 prompt processing speed

The prompt processing speed scales relatively well with batch size - the major gains though are in the text generation - running a 512 token-long prompt on a 32 GB M5:

B	pp tok/s	gen tok/s
1	353	16.0
4	429	24.9
8	451	32.4
16	451	44.2
32	450	48.0
64	448	54.6
128	440	54.0

If you want to download it and serve it yourself, you can open a terminal to a directory you want to download the repo to and run:

git clone https://github.com/lovelacemadeline/gemma4-turboquant-mlx

And then if you have uv installed (can also be done with pip3 but I prefer uv) you then:

cd gemma4-turboquant-mlx uv tool install --from . --reinstall gemma4-turboquant-mlx

Then once it's installed, you can spin up the backend with:

mlx_lm.server --model mlx-community/gemma-4-26b-a4b-it-nvfp4

And it should work 😄

(Note that if you're running on a Mac with 16 GB of RAM then you'll need to do the wired memory hack in order to get most quants of the Gemma MoE model running - I've included instructions for that in the repo)

submitted by /u/maddie-lovelace
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA