r/LocalLLaMA · · 1 min read

DiffusionGemma under real workloads feels very different from benchmark demos

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

DiffusionGemma under real workloads feels very different from benchmark demos

okay after testing DiffusionGemma a bit more internally we genuinely can’t tell if this is the start of something big or if everyone’s just getting distracted by crazy TPS numbers again lol

but one thing that stood out REALLY fast for us was how different the H100 vs A100 behavior felt compared to normal transformer inference

on some runs the H100s scaled almost exactly how you’d want them to

the A100s were still good, but once concurrency started increasing the gap widened way more than we expected. not the usual “yeah H100 is faster” difference - this felt more dramatic

another thing we noticed was that the model looks absolutely insane on cleaner workloads and shorter generations, but once you start mixing longer outputs, uneven request lengths, streaming, multiple users, different temperatures etc the behavior changes really fast

some workloads looked almost suspiciously fast honestly

then one messy real-world style batch would suddenly bring efficiency down harder than expected

also GPU utilization patterns looked very different from what we’re used to seeing with normal decode-heavy serving

hard to explain properly yet but it didn’t feel like the classic token-by-token bottleneck situation at all

dropping some pics from the A100 test boxes as well

we’re still testing a lot more combinations + real traffic simulations right now and honestly the more we test the more questions we have

will share more numbers once we finish running more workloads across the stacks

curious if others here are seeing similar behavior or completely different results

submitted by /u/qubridInc
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA