r/LocalLLaMA · · 1 min read

Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Fused MoE dispatch kernel in pure Triton: 89-131% of Megablocks, runs on AMD with zero code changes

I've been working on MoE inference and wrote a fused dispatch kernel entirely in Triton, no CUDA.

At inference batch sizes (up to 512 tokens) it reaches 89-131% of Megablocks(Stanford's CUDA-optimized MoE lib), and the same kernel runs on AMD MI300X with no changes. Mixtral-8x7B on A100.

The biggest win was fusing the gate+up projections so the SwiGLU intermediate never leaves registers, cutting 35% of global memory traffic. Fewer kernellaunches (5 vs 24+) helped but mattered less.

Honest limitations: it falls behind Megablocks at 2048+ tokens, and 64+ experts under heavy routing skew is still rough, so DeepSeek-V3-scale expert counts aren't there yet.

Code: https://github.com/bassrehab/triton-kernels

Writeup with benchmarks: https://subhadipmitra.com/blog/2026/fused-moe-dispatch-triton/

Paper: https://arxiv.org/abs/2605.23911

Feedback welcome, especially on the AMD perf side, which is still unoptimized.

submitted by /u/bassrehab
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA