r/LocalLLaMA · · 1 min read

Diffusion Gemma is 4x faster, but makes 6x more mistakes!

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Diffusion Gemma is 4x faster, but makes 6x more mistakes!

Benchmarked the new Gemma diffusion model against its autoregressive twin on a single H100 (FP8). We gave each the same three tasks: write a Steve Jobs biography, the history of Tetris, and the story of BeOS - every next topic less popular than the previous one. Then we fact-checked every claim in every answer.

Gemma4 got 45 facts right, 5 wrong. DiffusionGemma got 33 right, 28 wrong. The less popular the topic, the worse it got: 4 mistakes on Jobs, 12 on Tetris, 12 on BeOS. It named Clara Clley as Steve Jobs' mother, invented a colleague for Pajitnov named Geri Gulovik and priced the BeBox at $9,999. The real one cost $1,600.

Outputs:
Gemma4 26B A4B: 218 tok/s · 15.1s total · 45 facts · 5 mistakes
DiffusionGemma 26B A4B: 763 tok/s · 3.7s total · 33 facts · 28 mistakes

The reason is simple. DiffusionGemma throws 256 tokens on the screen at once and polishes them pass after pass until the text sounds smooth. Smooth is all it cares about: a fake name, date or number sounds just as smooth as a real one, so it stays. Regular Gemma4 meanwhile writes one word at a time and checks every new word against everything before it. Google says it themselves in the launch post: quality is lower, use regular Gemma 4 when facts matter.

Open source Local Ai models harness: Atomic.Chat (I'm founder, we support GGUF models, MLX Apple Silicon, MTP and Google TurboQuant for long context window, working on Diffusion support via llama.cpp)

submitted by /u/gladkos
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA