Hugging Face Daily Papers · · 3 min read

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We introduce the first systematic framework for converting a trained MoE into a fully dense model: score, select, group, concatenate into a dense FFN, then distill. A 350-config sweep on Qwen3-30B-A3B (also DeepSeek-V2-Lite, GPT-OSS-20B) finds our novel diversity-aware scoring consistently wins. At matched params, MoE→dense beats dense→dense pruning by +6.3pp at 1.6× faster training.</p>\n","updatedAt":"2026-06-09T13:55:46.149Z","author":{"_id":"668f418dd3b02463a739d2c2","avatarUrl":"/avatars/8f25ee88ea87036f388f523cc166aa4c.svg","fullname":"Junhyuck Kim","name":"jhyuckkim","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8562207221984863},"editors":["jhyuckkim"],"editorAvatarUrls":["/avatars/8f25ee88ea87036f388f523cc166aa4c.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.28207","authors":[{"_id":"6a1a0de0808ddbc3c7d42e9c","user":{"_id":"668f418dd3b02463a739d2c2","avatarUrl":"/avatars/8f25ee88ea87036f388f523cc166aa4c.svg","isPro":false,"fullname":"Junhyuck Kim","user":"jhyuckkim","type":"user","name":"jhyuckkim"},"name":"Junhyuck Kim","status":"claimed_verified","statusLastChangedAt":"2026-06-09T12:47:19.376Z","hidden":false},{"_id":"6a1a0de0808ddbc3c7d42e9d","name":"Jihun Yun","hidden":false},{"_id":"6a1a0de0808ddbc3c7d42e9e","name":"Haechan Kim","hidden":false},{"_id":"6a1a0de0808ddbc3c7d42e9f","name":"Gyeongman Kim","hidden":false},{"_id":"6a1a0de0808ddbc3c7d42ea0","name":"Joonghyun Bae","hidden":false},{"_id":"6a1a0de0808ddbc3c7d42ea1","name":"Jaewoong Cho","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Pruning and Distilling Mixture-of-Experts into Dense Language Models","submittedOnDailyBy":{"_id":"668f418dd3b02463a739d2c2","avatarUrl":"/avatars/8f25ee88ea87036f388f523cc166aa4c.svg","isPro":false,"fullname":"Junhyuck Kim","user":"jhyuckkim","type":"user","name":"jhyuckkim"},"summary":"Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.","upvotes":0,"discussionId":"6a1a0de0808ddbc3c7d42ea2","githubRepo":"https://github.com/krafton-ai/moe-to-dense","githubRepoAddedBy":"user","ai_summary":"A systematic framework converts mixture-of-experts models into dense architectures through expert scoring, selection, grouping, and knowledge distillation, achieving superior performance and efficiency compared to traditional pruning methods.","ai_keywords":["Mixture-of-Experts","knowledge distillation","expert scoring","expert selection","expert grouping","dense feedforward network","parameter-efficient conversion"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"6448c201cf9cf3ef36e4f63b","name":"KRAFTON","fullname":"KRAFTON","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6135bc9fa35cb05987acc322/lo1F9RqWgGUSz9n_3CItO.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"6448c201cf9cf3ef36e4f63b","name":"KRAFTON","fullname":"KRAFTON","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6135bc9fa35cb05987acc322/lo1F9RqWgGUSz9n_3CItO.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.28207.md"}">
Papers
arxiv:2605.28207

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Published on May 27
· Submitted by
Junhyuck Kim
on Jun 9
Authors:
,
,
,
,

Abstract

A systematic framework converts mixture-of-experts models into dense architectures through expert scoring, selection, grouping, and knowledge distillation, achieving superior performance and efficiency compared to traditional pruning methods.

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

Community

Paper author Paper submitter about 5 hours ago
edited about 5 hours ago

We introduce the first systematic framework for converting a trained MoE into a fully dense model: score, select, group, concatenate into a dense FFN, then distill. A 350-config sweep on Qwen3-30B-A3B (also DeepSeek-V2-Lite, GPT-OSS-20B) finds our novel diversity-aware scoring consistently wins. At matched params, MoE→dense beats dense→dense pruning by +6.3pp at 1.6× faster training.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.28207
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28207 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28207 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers