arXiv — Machine Learning · May 13, 2026 · 1 min read

DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

Mirrored from arXiv — Machine Learning for archival readability. Support the source by reading on the original site.

Like Read original ↗

arXiv:2605.11005v1 Announce Type: new Abstract: Mixture-of-experts (MoE) architectures enable trillion-parameter LLMs with sparsely activated experts. Expert parallelism (EP) is a widely adopted MoE training strategy, but it suffers from severe all-to-all communication bottlenecks, which is exaggerated by the limited inter-node network bandwidth as the growing model size requires distributing experts across GPU nodes. Prior work focused on overlapping these all-to-all communications with feed-forward network (FFN) and self-attention computations, which often leaves residual network-bound stalls due to inherent imbalance in attention and FFN layers' computation-communication ratios. We present DisagMoE, a disaggregated MoE training system that jointly optimizes model placement and scheduling for maximal efficiency. DisagMoE separates attention and FFN layers into disjoint GPU groups, introduces a multi-stage pipeline with uni-directional, many-to-many communications, and employs a computation-communication roofline model to balance GPU and network bandwidth allocation among the attention and FFN groups. DisagMoE is implemented on Megatron-LM, and evaluation shows that DisagMoE improves training efficiency across multiple MoE models with up to 1.8x speedup on 16-node 8xH800 clusters.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from arXiv — Machine Learning