r/LocalLLaMA · June 18, 2026 · 1 min read

DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Figured I'd post up a bit of info for anyone else who was thinking about messing with this model on a 3090/4090.

Obviously I can't use the nvfp4, but I got it up and running in vLLM using diffusiongemma-26B-A4B-it-AWQ-INT4. Had to run it in a custom vLLM docker they provide for the purpose, then load a gemma 4 tool/reasoning parser. Once it was all done, it pushed 475t/s on the first prompt, and seems to run between 290t/s and 700t/s depending on output length and context (long outputs come out very fast). It's pretty heavy though, so you're not getting long context (I tested at 8k and could have gone higher, but not THAT much higher).

Downsides? It's single-user only (it slows down if you try to batch it), clearly worse at responses (makes mistakes the regular 26ba4b doesn't), and it can't find a needle in a haystack to save its life (context fades quick). Time to first token is a hair slower too on short prompts than a regular llm (it's diffusing everything and giving you the chunks all at once, so it takes a bit longer to get that first chunk).

Is it worth bothering with? I don't think so. The regular 26ba4b running through llama.cpp still nails down over 300t/s when batched, and it's significantly more accurate.

submitted by /u/teachersecret
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA