Hugging Face Daily Papers · June 9, 2026 · 3 min read

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We introduce the first systematic framework for converting a trained MoE into a fully dense model: score, select, group, concatenate into a dense FFN, then distill. A 350-config sweep on Qwen3-30B-A3B (also DeepSeek-V2-Lite, GPT-OSS-20B) finds our novel diversity-aware scoring consistently wins. At matched params, MoE→dense beats dense→dense pruning by +6.3pp at 1.6× faster training.</p>\n","updatedAt":"2026-06-09T13:55:46.149Z","author":{"_id":"668f418dd3b02463a739d2c2","avatarUrl":"/avatars/8f25ee88ea87036f388f523cc166aa4c.svg","fullname":"Junhyuck Kim","name":"jhyuckkim","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8562207221984863},"editors":["jhyuckkim"],"editorAvatarUrls":["/avatars/8f25ee88ea87036f388f523cc166aa4c.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.28207","authors":[{"_id":"6a1a0de0808ddbc3c7d42e9c","user":{"_id":"668f418dd3b02463a739d2c2","avatarUrl":"/avatars/8f25ee88ea87036f388f523cc166aa4c.svg","isPro":false,"fullname":"Junhyuck Kim","user":"jhyuckkim","type":"user","name":"jhyuckkim"},"name":"Junhyuck Kim","status":"claimed_verified","statusLastChangedAt":"2026-06-09T12:47:19.376Z","hidden":false},{"_id":"6a1a0de0808ddbc3c7d42e9d","name":"Jihun Yun","hidden":false},{"_id":"6a1a0de0808ddbc3c7d42e9e","name":"Haechan Kim","hidden":false},{"_id":"6a1a0de0808ddbc3c7d42e9f","name":"Gyeongman Kim","hidden":false},{"_id":"6a1a0de0808ddbc3c7d42ea0","name":"Joonghyun Bae","hidden":false},{"_id":"6a1a0de0808ddbc3c7d42ea1","name":"Jaewoong Cho","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Pruning and Distilling Mixture-of-Experts into Dense Language Models","submittedOnDailyBy":{"_id":"668f418dd3b02463a739d2c2","avatarUrl":"/avatars/8f25ee88ea87036f388f523cc166aa4c.svg","isPro":false,"fullname":"Junhyuck Kim","user":"jhyuckkim","type":"user","name":"jhyuckkim"},"summary":"Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.","upvotes":0,"discussionId":"6a1a0de0808ddbc3c7d42ea2","githubRepo":"https://github.com/krafton-ai/moe-to-dense","githubRepoAddedBy":"user","ai_summary":"A systematic framework converts mixture-of-experts models into dense architectures through expert scoring, selection, grouping, and knowledge distillation, achieving superior performance and efficiency compared to traditional pruning methods.","ai_keywords":["Mixture-of-Experts","knowledge distillation","expert scoring","expert selection","expert grouping","dense feedforward network","parameter-efficient conversion"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"6448c201cf9cf3ef36e4f63b","name":"KRAFTON","fullname":"KRAFTON","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6135bc9fa35cb05987acc322/lo1F9RqWgGUSz9n_3CItO.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"6448c201cf9cf3ef36e4f63b","name":"KRAFTON","fullname":"KRAFTON","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6135bc9fa35cb05987acc322/lo1F9RqWgGUSz9n_3CItO.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.28207.md"}">

Papers

arxiv:2605.28207

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Published on May 27

· Submitted by

Junhyuck Kim on Jun 9

KRAFTON

Upvote

Authors:

Junhyuck Kim ,

Abstract

A systematic framework converts mixture-of-experts models into dense architectures through expert scoring, selection, grouping, and knowledge distillation, achieving superior performance and efficiency compared to traditional pruning methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

View arXiv page View PDF GitHub 0 Add to collection

Community

jhyuckkim

Paper author Paper submitter about 5 hours ago

•

edited about 5 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.28207

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28207 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28207 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Abstract

Community

Models citing this paper 2

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers