We propose a redesign of the MoE router using Power Iteration during forward pass to couple router weights and expert parameters within the singular space of the parameters. We contend that this imposes an explicit constraint that forces router weights to better reflect the parametric characteristics of the expert weights, resulting in optimized expert routing. Our initial results and extensive analysis validate the effectiveness of this design. We hope our work inspires researchers to rethink MoE routers and leads to more valuable insights for future router designs.</p>\n","updatedAt":"2026-06-11T12:28:47.153Z","author":{"_id":"662aa42f4eaa187e4cf6827b","avatarUrl":"/avatars/17139f0b6e8092cf4c135028db03a7ff.svg","fullname":"Songhao Wu","name":"shwu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8728195428848267},"editors":["shwu"],"editorAvatarUrls":["/avatars/17139f0b6e8092cf4c135028db03a7ff.svg"],"reactions":[],"isReport":false}},{"id":"6a2aa6a561e25c785a0b4aeb","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-11T12:14:29.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is a neat approach to MoE routing. I like the idea of moving away from arbitrary router weights and instead using the principal singular direction of the experts to guide the selection process. It feels like a much more grounded way to define token-expert affinity than how most models currently handle it.\n\nSince this uses a Power-then-Retract paradigm, how much of a computational overhead does this add during the training loop compared to standard routing?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/b091d9ea-bfd5-4ea9-bced-18546d1f87e4","html":"<p>This is a neat approach to MoE routing. I like the idea of moving away from arbitrary router weights and instead using the principal singular direction of the experts to guide the selection process. It feels like a much more grounded way to define token-expert affinity than how most models currently handle it.</p>\n<p>Since this uses a Power-then-Retract paradigm, how much of a computational overhead does this add during the training loop compared to standard routing?</p>\n<p>I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:<br><a href=\"https://researchpod.app/episode/b091d9ea-bfd5-4ea9-bced-18546d1f87e4\" rel=\"nofollow\">https://researchpod.app/episode/b091d9ea-bfd5-4ea9-bced-18546d1f87e4</a></p>\n","updatedAt":"2026-06-11T12:14:29.166Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9278668761253357},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12397","authors":[{"_id":"6a2a33e080a9c7c6830c0fc5","user":{"_id":"662aa42f4eaa187e4cf6827b","avatarUrl":"/avatars/17139f0b6e8092cf4c135028db03a7ff.svg","isPro":false,"fullname":"Songhao Wu","user":"shwu","type":"user","name":"shwu"},"name":"Songhao Wu","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:11.792Z","hidden":false},{"_id":"6a2a33e080a9c7c6830c0fc6","user":{"_id":"64b8ca3c5067873176d4b436","avatarUrl":"/avatars/b659d147b2454b47c9a7e89bbed525fc.svg","isPro":false,"fullname":"AngLv","user":"AngLv","type":"user","name":"AngLv"},"name":"Ang Lv","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:09.713Z","hidden":false},{"_id":"6a2a33e080a9c7c6830c0fc7","name":"Ruobing Xie","hidden":false},{"_id":"6a2a33e080a9c7c6830c0fc8","name":"Yankai Lin","hidden":false}],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"Redesign Mixture-of-Experts Routers with Manifold Power Iteration","submittedOnDailyBy":{"_id":"662aa42f4eaa187e4cf6827b","avatarUrl":"/avatars/17139f0b6e8092cf4c135028db03a7ff.svg","isPro":false,"fullname":"Songhao Wu","user":"shwu","type":"user","name":"shwu"},"summary":"Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a \"Power-then-Retract\" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.","upvotes":74,"discussionId":"6a2a33e080a9c7c6830c0fc9","githubRepo":"https://github.com/ericshwu/Router-with-Manifold-Power-Iteration","githubRepoAddedBy":"user","ai_summary":"Researchers propose a novel router redesign for Mixture-of-Experts models that aligns router rows with the principal singular directions of expert matrices using Manifold Power Iteration to improve model effectiveness.","ai_keywords":["Mixture-of-Experts","router","expert proxies","router matrix","singular value decomposition","Manifold Power Iteration","power iteration","retraction","principal singular direction","expert matrix"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"662aa42f4eaa187e4cf6827b","avatarUrl":"/avatars/17139f0b6e8092cf4c135028db03a7ff.svg","isPro":false,"fullname":"Songhao Wu","user":"shwu","type":"user"},{"_id":"627a124ffe55fa0f8ce0eaf7","avatarUrl":"/avatars/41e0dc029faed6dc45d620c5fe2652a5.svg","isPro":false,"fullname":"Serendipity","user":"Yuhan","type":"user"},{"_id":"655dd12bdcb845354c1990a3","avatarUrl":"/avatars/9001fc7d08d09df59d01608b11e59252.svg","isPro":false,"fullname":"Tan","user":"RiccardTo","type":"user"},{"_id":"698ab2ebc9804eab58756f66","avatarUrl":"/avatars/797aa01a039a42671b8140c7742c71a5.svg","isPro":false,"fullname":"ShuqiYe","user":"ShuqiYe","type":"user"},{"_id":"67e244909fee6aa2b9bdeaf8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/BAjL8UtBNdOlQOawHHVUI.png","isPro":false,"fullname":"CentreChen","user":"CentreChen","type":"user"},{"_id":"64bb937d8496ee0fb6cac9aa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64bb937d8496ee0fb6cac9aa/oFkKNxaMrd3wAciwP4Lu5.png","isPro":false,"fullname":"YijuGuo","user":"YijuGuo","type":"user"},{"_id":"65962d1d5b7d033566daf786","avatarUrl":"/avatars/652180141eb8dd9b30defad05997fdc8.svg","isPro":false,"fullname":"guirong chen","user":"aaaGUI","type":"user"},{"_id":"664c94f71959997352fc1946","avatarUrl":"/avatars/1622bea455771298658578fab24ecee7.svg","isPro":false,"fullname":"Jingwen Chen","user":"cjw259wen","type":"user"},{"_id":"6a268864e5e6e96da5015d39","avatarUrl":"/avatars/443651ac1d0ddf1fc0d857a49f018a7f.svg","isPro":false,"fullname":"James Choi","user":"JamesChoiUp","type":"user"},{"_id":"6a26895230ee6257332c272c","avatarUrl":"/avatars/488dc52106515dcb55aabb378b489b0c.svg","isPro":false,"fullname":"Ethan Wong","user":"EthannWong","type":"user"},{"_id":"68390c1e627dfd60c9e184a2","avatarUrl":"/avatars/d88dcd34b07a33e77878d2371c377bae.svg","isPro":false,"fullname":"MavisWang30","user":"MavisWang","type":"user"},{"_id":"6a268a4ce5e6e96da50177fe","avatarUrl":"/avatars/6cd98aae3fa4b2bb4ba234f8e33bbdef.svg","isPro":false,"fullname":"Henry Li","user":"HenrxyLi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":1,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12397.md"}">
Redesign Mixture-of-Experts Routers with Manifold Power Iteration
Abstract
Researchers propose a novel router redesign for Mixture-of-Experts models that aligns router rows with the principal singular directions of expert matrices using Manifold Power Iteration to improve model effectiveness.
Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.
Community
We propose a redesign of the MoE router using Power Iteration during forward pass to couple router weights and expert parameters within the singular space of the parameters. We contend that this imposes an explicit constraint that forces router weights to better reflect the parametric characteristics of the expert weights, resulting in optimized expert routing. Our initial results and extensive analysis validate the effectiveness of this design. We hope our work inspires researchers to rethink MoE routers and leads to more valuable insights for future router designs.
This is a neat approach to MoE routing. I like the idea of moving away from arbitrary router weights and instead using the principal singular direction of the experts to guide the selection process. It feels like a much more grounded way to define token-expert affinity than how most models currently handle it.
Since this uses a Power-then-Retract paradigm, how much of a computational overhead does this add during the training loop compared to standard routing?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/b091d9ea-bfd5-4ea9-bced-18546d1f87e4
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.12397 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.12397 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.12397 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.