Hugging Face Daily Papers · May 29, 2026 · 7 min read

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

This paper provides a data-centric account of why larger models learn tasks smaller models fail, attributing this to reduced gradient interference and more efficient resource allocation for rare tasks.\n","updatedAt":"2026-05-29T03:23:17.982Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":307,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8949888348579407},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1a407a14ed232317fe600a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:42:18.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [What do Language Models Learn and When? The Implicit Curriculum Hypothesis](https://huggingface.co/papers/2604.08510) (2026)\n* [Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise](https://huggingface.co/papers/2605.18022) (2026)\n* [Understanding Generalization and Forgetting in In-Context Continual Learning](https://huggingface.co/papers/2605.28705) (2026)\n* [Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and Generalization](https://huggingface.co/papers/2605.27989) (2026)\n* [Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts](https://huggingface.co/papers/2604.08519) (2026)\n* [Slower Generalization, Faster Memorization: A Sweet Spot in Algorithmic Learning](https://huggingface.co/papers/2605.14659) (2026)\n* [The Power of Power Law: Asymmetry Enables Compositional Reasoning](https://huggingface.co/papers/2604.22951) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.08510\">What do Language Models Learn and When? The Implicit Curriculum Hypothesis</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18022\">Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.28705\">Understanding Generalization and Forgetting in In-Context Continual Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.27989\">Law of Neural Interaction: Depth-Width Shape, Interaction Efficiency, and Generalization</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.08519\">Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.14659\">Slower Generalization, Faster Memorization: A Sweet Spot in Algorithmic Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.22951\">The Power of Power Law: Asymmetry Enables Compositional Reasoning</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-30T01:42:18.123Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7514906525611877},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29548","authors":[{"_id":"6a19069f56b4bb14ec65cf96","name":"Jing Huang","hidden":false},{"_id":"6a19069f56b4bb14ec65cf97","name":"Daniel Wurgaft","hidden":false},{"_id":"6a19069f56b4bb14ec65cf98","name":"Rachit Bansal","hidden":false},{"_id":"6a19069f56b4bb14ec65cf99","name":"Laura Ruis","hidden":false},{"_id":"6a19069f56b4bb14ec65cf9a","name":"Naomi Saphra","hidden":false},{"_id":"6a19069f56b4bb14ec65cf9b","name":"David Alvarez-Melis","hidden":false},{"_id":"6a19069f56b4bb14ec65cf9c","name":"Andrew Kyle Lampinen","hidden":false},{"_id":"6a19069f56b4bb14ec65cf9d","name":"Christopher Potts","hidden":false},{"_id":"6a19069f56b4bb14ec65cf9e","name":"Ekdeep Singh Lubana","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.","upvotes":3,"discussionId":"6a19069f56b4bb14ec65cf9f","ai_summary":"Larger models outperform smaller ones on complex and rare tasks due to reduced gradient interference and better resource allocation, enabling them to learn task features that smaller models miss even with infinite data.","ai_keywords":["power-law scaling","data-induced competition","neurons","gradient updates","task features","model scaling","gradient interference","model sizing","training data mixtures"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29548.md"}">

Papers

arxiv:2605.29548

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Published on May 28

· Submitted by

taesiri on May 29

Upvote

Authors:

Abstract

Larger models outperform smaller ones on complex and rare tasks due to reduced gradient interference and better resource allocation, enabling them to learn task features that smaller models miss even with infinite data.

AI-generated summary

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.

View arXiv page View PDF Add to collection

Community

taesiri

Paper submitter 1 day ago

librarian-bot

about 13 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.29548

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.29548 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.29548 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.29548 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

No comments yet. Sign in and be the first to say something.

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 2

Discussion (0)

More from Hugging Face Daily Papers