Hugging Face Daily Papers · June 2, 2026 · 3 min read

DOT-MoE: Differentiable Optimal Transport for MoEfication

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

A differentiable optimal transport framework for dense-to-MoE model conversion; retaining 90% of dense performance at 50% active parameters.</p>\n","updatedAt":"2026-06-02T19:24:22.379Z","author":{"_id":"64bf3d552915a87970ba2d65","avatarUrl":"/avatars/3d11f6eb3109fb67520e1aaa555103bf.svg","fullname":"Udbhav Bamba","name":"udbhavbamba","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7992733120918274},"editors":["udbhavbamba"],"editorAvatarUrls":["/avatars/3d11f6eb3109fb67520e1aaa555103bf.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01666","authors":[{"_id":"6a1f2d60e292c1c78ecb1212","name":"Udbhav Bamba","hidden":false},{"_id":"6a1f2d60e292c1c78ecb1213","name":"Arnav Chavan","hidden":false},{"_id":"6a1f2d60e292c1c78ecb1214","name":"Aryamaan Thakur","hidden":false},{"_id":"6a1f2d60e292c1c78ecb1215","name":"Steve Teig","hidden":false},{"_id":"6a1f2d60e292c1c78ecb1216","name":"Deepak Gupta","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"DOT-MoE: Differentiable Optimal Transport for MoEfication","submittedOnDailyBy":{"_id":"64bf3d552915a87970ba2d65","avatarUrl":"/avatars/3d11f6eb3109fb67520e1aaa555103bf.svg","isPro":false,"fullname":"Udbhav Bamba","user":"udbhavbamba","type":"user","name":"udbhavbamba"},"summary":"The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.","upvotes":0,"discussionId":"6a1f2d60e292c1c78ecb1217","ai_summary":"DOT-MoE formulates dense layer decomposition as a differentiable optimal transport problem, enabling efficient training of sparse MoE models with improved performance retention.","ai_keywords":["Mixture of Experts","Feed-Forward Network","differentiable Sinkhorn-Knopp iterations","Straight-Through Estimators","optimal transport","neuron assignment","expert capacity constraints","token-to-expert routing","structured pruning","heuristic clustering","random splitting"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.01666.md"}">

Papers

arxiv:2606.01666

DOT-MoE: Differentiable Optimal Transport for MoEfication

Published on Jun 1

· Submitted by

Udbhav Bamba on Jun 2

Upvote

Authors:

Abstract

DOT-MoE formulates dense layer decomposition as a differentiable optimal transport problem, enabling efficient training of sparse MoE models with improved performance retention.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.

View arXiv page View PDF Add to collection

Community

udbhavbamba

Paper submitter about 7 hours ago

A differentiable optimal transport framework for dense-to-MoE model conversion; retaining 90% of dense performance at 50% active parameters.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.01666

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.01666 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.01666 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.01666 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

DOT-MoE: Differentiable Optimal Transport for MoEfication

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers