Hugging Face Daily Papers · · 4 min read

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Same Architecture, Different Optimizer, Different Capacity: In this work, we have shown that realized representation capacity is not architecture-only, it emerges from the architecture-optimizer interaction. Optimizer geometry changes the scaling exponents that govern how FFN width converted into usable capacity. Further, through controlled comparisons, we found that optimizer-induced shifts in spectral scaling often exceeds the shift caused by architectural interventions.</p>\n","updatedAt":"2026-05-22T02:11:05.613Z","author":{"_id":"670ec3f6db1a6bcfe832e0a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/-mZNLeLJoXzkwPgYO38lF.png","fullname":"Nandan Kumar Jha","name":"nandan523","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9021117687225342},"editors":["nandan523"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/-mZNLeLJoXzkwPgYO38lF.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.21803","authors":[{"_id":"6a0fba2ca53a61ce2e422c47","name":"Nandan Kumar Jha","hidden":false},{"_id":"6a0fba2ca53a61ce2e422c48","name":"Brandon Reagen","hidden":false}],"publishedAt":"2026-05-20T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws","submittedOnDailyBy":{"_id":"670ec3f6db1a6bcfe832e0a6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/-mZNLeLJoXzkwPgYO38lF.png","isPro":false,"fullname":"Nandan Kumar Jha","user":"nandan523","type":"user","name":"nandan523"},"summary":"Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (β=1.02) in the same regimes, a 2.3times increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.","upvotes":1,"discussionId":"6a0fba2ca53a61ce2e422c49","projectPage":"https://optimizer-scaling-laws.github.io/","ai_summary":"Different optimizers produce distinct spectral scaling behaviors in Transformer models, with Muon achieving superior scaling efficiency compared to AdamW in representation capacity utilization.","ai_keywords":["Transformer architecture","feed-forward network","spectral scaling laws","optimizer","eigenspectra","soft spectral-rank","hard spectral-rank","AdamW","Muon","representation capacity","attention rank","positional encoding"],"organization":{"_id":"63f68bebb29015adc33fb06b","name":"nyuniversity","fullname":"New York University","avatar":"https://www.gravatar.com/avatar/386164389c9379796b8f1f6620b82878?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69a4123ee28e4c550222ff4b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/P1i21fCY6VE5jw1QcHmUY.png","isPro":false,"fullname":"石川翔太","user":"williammoore8","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63f68bebb29015adc33fb06b","name":"nyuniversity","fullname":"New York University","avatar":"https://www.gravatar.com/avatar/386164389c9379796b8f1f6620b82878?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.21803.md"}">
Papers
arxiv:2605.21803

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

Published on May 20
· Submitted by
Nandan Kumar Jha
on May 22
Authors:
,

Abstract

Different optimizers produce distinct spectral scaling behaviors in Transformer models, with Muon achieving superior scaling efficiency compared to AdamW in representation capacity utilization.

AI-generated summary

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (β=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (β=1.02) in the same regimes, a 2.3times increase in the scaling exponent. This difference is not reducible to validation loss: AdamW configurations can match low-rank Dion variants in perplexity, under extended training, while exhibiting sharply different spectral geometry, demonstrating that matched loss does not imply matched representation structure. Hard--soft rank asymmetry further reveals that optimizers differ not only in how much capacity is realized, but also in how that capacity is structured across eigenmodes. To disentangle optimizer effects from architectural ones, we compare against architectural interventions (e.g., attention rank and positional encoding), and find that optimizer-induced spectral shifts often exceed the architectural effects. These results suggest optimization as a first-class axis of representation scaling, motivating optimizer--architecture co-design.

Community

Paper submitter about 10 hours ago

Same Architecture, Different Optimizer, Different Capacity: In this work, we have shown that realized representation capacity is not architecture-only, it emerges from the architecture-optimizer interaction. Optimizer geometry changes the scaling exponents that govern how FFN width converted into usable capacity. Further, through controlled comparisons, we found that optimizer-induced shifts in spectral scaling often exceeds the shift caused by architectural interventions.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.21803
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.21803 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.21803 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.21803 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers