DiffusionGemma under real workloads feels very different from benchmark demos
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| okay after testing DiffusionGemma a bit more internally we genuinely can’t tell if this is the start of something big or if everyone’s just getting distracted by crazy TPS numbers again lol but one thing that stood out REALLY fast for us was how different the H100 vs A100 behavior felt compared to normal transformer inference on some runs the H100s scaled almost exactly how you’d want them to the A100s were still good, but once concurrency started increasing the gap widened way more than we expected. not the usual “yeah H100 is faster” difference - this felt more dramatic another thing we noticed was that the model looks absolutely insane on cleaner workloads and shorter generations, but once you start mixing longer outputs, uneven request lengths, streaming, multiple users, different temperatures etc the behavior changes really fast some workloads looked almost suspiciously fast honestly then one messy real-world style batch would suddenly bring efficiency down harder than expected also GPU utilization patterns looked very different from what we’re used to seeing with normal decode-heavy serving hard to explain properly yet but it didn’t feel like the classic token-by-token bottleneck situation at all dropping some pics from the A100 test boxes as well we’re still testing a lot more combinations + real traffic simulations right now and honestly the more we test the more questions we have will share more numbers once we finish running more workloads across the stacks curious if others here are seeing similar behavior or completely different results [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.