Hugging Face Daily Papers · June 2, 2026 · 3 min read

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

κ-SwiGLU is a confidence-aware SwiGLU variant for MoE models that uses router logits to adapt expert gate sharpness, improving pretraining performance with negligible additional parameters and small computational overhead.</p>\n","updatedAt":"2026-06-02T08:20:00.121Z","author":{"_id":"66741ffea5f9723c76839cc8","avatarUrl":"/avatars/583039bb203fa21ece43ff23aaf0606f.svg","fullname":"adaface","name":"adaface-neurips","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.802236795425415},"editors":["adaface-neurips"],"editorAvatarUrls":["/avatars/583039bb203fa21ece43ff23aaf0606f.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.00761","authors":[{"_id":"6a1e4cea808ddbc3c7d43ce7","name":"Shaohua Li","hidden":false},{"_id":"6a1e4cea808ddbc3c7d43ce8","name":"Xiuchao Sui","hidden":false},{"_id":"6a1e4cea808ddbc3c7d43ce9","name":"Xiaobing Sun","hidden":false},{"_id":"6a1e4cea808ddbc3c7d43cea","name":"Yuhang Wu","hidden":false},{"_id":"6a1e4cea808ddbc3c7d43ceb","name":"Liangli Zhen","hidden":false},{"_id":"6a1e4cea808ddbc3c7d43cec","name":"Yong Liu","hidden":false},{"_id":"6a1e4cea808ddbc3c7d43ced","name":"Rick Siow Mong Goh","hidden":false}],"publishedAt":"2026-05-30T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Confidence-Adaptive SwiGLU for Mixture-of-Experts","submittedOnDailyBy":{"_id":"66741ffea5f9723c76839cc8","avatarUrl":"/avatars/583039bb203fa21ece43ff23aaf0606f.svg","isPro":true,"fullname":"adaface","user":"adaface-neurips","type":"user","name":"adaface-neurips"},"summary":"SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU (κ-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, κ-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, κ-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.","upvotes":1,"discussionId":"6a1e4ceb808ddbc3c7d43cee","githubRepo":"https://github.com/askerlee/kappa-swiglu","githubRepoAddedBy":"user","ai_summary":"Confidence-Aware SwiGLU adjusts expert gate sharpness in Mixture-of-Experts models based on token-level routing confidence, improving performance with minimal computational overhead.","ai_keywords":["SwiGLU","Mixture-of-Experts","MoE","gating function","router logit","SiLU","expert gate units","CORE performance","token-level routing confidence"],"githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6a1457a791d4af20d401ea6d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/x7EUmPDLMIWpOsgHg0BdU.png","isPro":false,"fullname":"王欣宇","user":"josephharris174","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.00761.md"}">

Papers

arxiv:2606.00761

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Published on May 30

· Submitted by

adaface on Jun 2

Upvote

Authors:

Abstract

Confidence-Aware SwiGLU adjusts expert gate sharpness in Mixture-of-Experts models based on token-level routing confidence, improving performance with minimal computational overhead.

AI-generated summary

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU (κ-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, κ-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate κ-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, κ-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.

View arXiv page View PDF GitHub 0 Add to collection

Community

adaface-neurips

Paper submitter about 2 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.00761

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.00761 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.00761 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.00761 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers