DiffusionGemma 26b on a 4090 at up to 475t/s... and some thoughts...
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Figured I'd post up a bit of info for anyone else who was thinking about messing with this model on a 3090/4090.
Obviously I can't use the nvfp4, but I got it up and running in vLLM using diffusiongemma-26B-A4B-it-AWQ-INT4. Had to run it in a custom vLLM docker they provide for the purpose, then load a gemma 4 tool/reasoning parser. Once it was all done, it pushed 475t/s on the first prompt, and seems to run between 290t/s and 700t/s depending on output length and context (long outputs come out very fast). It's pretty heavy though, so you're not getting long context (I tested at 8k and could have gone higher, but not THAT much higher).
Downsides? It's single-user only (it slows down if you try to batch it), clearly worse at responses (makes mistakes the regular 26ba4b doesn't), and it can't find a needle in a haystack to save its life (context fades quick). Time to first token is a hair slower too on short prompts than a regular llm (it's diffusing everything and giving you the chunks all at once, so it takes a bit longer to get that first chunk).
Is it worth bothering with? I don't think so. The regular 26ba4b running through llama.cpp still nails down over 300t/s when batched, and it's significantly more accurate.
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.