Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I've been working on MoE inference and wrote a fused dispatch kernel entirely in Triton, no CUDA. At inference batch sizes (up to 512 tokens) it reaches 89-131% of Megablocks(Stanford's CUDA-optimized MoE lib), and the same kernel runs on AMD MI300X with no changes. Mixtral-8x7B on A100. The biggest win was fusing the gate+up projections so the SwiGLU intermediate never leaves registers, cutting 35% of global memory traffic. Fewer kernellaunches (5 vs 24+) helped but mattered less. Honest limitations: it falls behind Megablocks at 2048+ tokens, and 64+ experts under heavy routing skew is still rough, so DeepSeek-V3-scale expert counts aren't there yet. Code: https://github.com/bassrehab/triton-kernels Writeup with benchmarks: https://subhadipmitra.com/blog/2026/fused-moe-dispatch-triton/ Paper: https://arxiv.org/abs/2605.23911 Feedback welcome, especially on the AMD perf side, which is still unoptimized. [link] [comments] |
More from r/LocalLLaMA
-
KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche
May 27
-
Why are the AI Companies spreading F.U.D. about AI?
May 27
-
Q4_K_M is fine for chat and a trap for agents. Here is math mathing.
May 27
-
I ran 8 open-weight models as agents in a persistent MMO for 10 days. Here's the 93k event dataset and some things that I learned
May 27
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.