Hugging Face Daily Papers · · 5 min read

Language Models Need Sleep

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.</p>\n","updatedAt":"2026-05-26T11:38:40.786Z","author":{"_id":"65255f1073a043e50d043641","avatarUrl":"/avatars/257085f01c439d7c84787a4e6d085b3d.svg","fullname":"Sean McLeish","name":"smcleish","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9377782344818115},"editors":["smcleish"],"editorAvatarUrls":["/avatars/257085f01c439d7c84787a4e6d085b3d.svg"],"reactions":[],"isReport":false}},{"id":"6a15a60b7ac4e78564c8e71e","author":{"_id":"69fa9c2f8023820e9f89048b","avatarUrl":"/avatars/b93a22ac521faf63064d264695563ce3.svg","fullname":"Elijah McMahon","name":"hello-world-i-am-human","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-05-26T13:54:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Do you think this could be used to help LLMs understand new words introduced to them at inference time? For example un-trained-on, obscure medical terms, explained using words it does know, could this help it understand those words more deeply? If LLMs already learn things by looking at relationships and storing those relationships in their weights, then it would make sense that using fast weights in this way might help.","html":"<p>Do you think this could be used to help LLMs understand new words introduced to them at inference time? For example un-trained-on, obscure medical terms, explained using words it does know, could this help it understand those words more deeply? If LLMs already learn things by looking at relationships and storing those relationships in their weights, then it would make sense that using fast weights in this way might help.</p>\n","updatedAt":"2026-05-26T13:54:19.439Z","author":{"_id":"69fa9c2f8023820e9f89048b","avatarUrl":"/avatars/b93a22ac521faf63064d264695563ce3.svg","fullname":"Elijah McMahon","name":"hello-world-i-am-human","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9724226593971252},"editors":["hello-world-i-am-human"],"editorAvatarUrls":["/avatars/b93a22ac521faf63064d264695563ce3.svg"],"reactions":[],"isReport":false}},{"id":"6a15fec992cf3e581de8c917","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-26T20:12:57.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/language-models-need-sleep-4798-2d8affbd\nCovers the executive summary, detailed methodology, and practical applications.","html":"<p>Interesting breakdown of this paper on arXivLens: <a href=\"https://arxivlens.com/PaperView/Details/language-models-need-sleep-4798-2d8affbd\" rel=\"nofollow\">https://arxivlens.com/PaperView/Details/language-models-need-sleep-4798-2d8affbd</a><br>Covers the executive summary, detailed methodology, and practical applications.</p>\n","updatedAt":"2026-05-26T20:12:57.255Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7778586745262146},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26099","authors":[{"_id":"6a158608b57a1823d5708eb9","name":"Sangyun Lee","hidden":false},{"_id":"6a158608b57a1823d5708eba","name":"Sean McLeish","hidden":false},{"_id":"6a158608b57a1823d5708ebb","name":"Tom Goldstein","hidden":false},{"_id":"6a158608b57a1823d5708ebc","name":"Giulia Fanti","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Language Models Need Sleep","submittedOnDailyBy":{"_id":"65255f1073a043e50d043641","avatarUrl":"/avatars/257085f01c439d7c84787a4e6d085b3d.svg","isPro":true,"fullname":"Sean McLeish","user":"smcleish","type":"user","name":"smcleish"},"summary":"Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.","upvotes":6,"discussionId":"6a158609b57a1823d5708ebd","ai_summary":"A sleep-like consolidation mechanism for transformer models uses fast weights and recurrent passes to improve long-context processing while maintaining inference speed.","ai_keywords":["transformer-based large language models","attention mechanism","context length","sleep-like consolidation mechanism","fast weights","key-value cache","state-space model","recurrent passes","cellular automata","multi-hop graph retrieval","math reasoning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69fa9c2f8023820e9f89048b","avatarUrl":"/avatars/b93a22ac521faf63064d264695563ce3.svg","isPro":false,"fullname":"Elijah McMahon","user":"hello-world-i-am-human","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"},{"_id":"64b8e82aa62c52b252c827fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b8e82aa62c52b252c827fa/Jyk5PHMXCaRlmWy4mT3Bt.jpeg","isPro":true,"fullname":"Rajkumar rawal","user":"rajkumarrawal","type":"user"},{"_id":"67f27842e63e1832cb1d10f1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/TbQvqg4phWzMugEGhCJJ9.png","isPro":false,"fullname":"Shehryaar Shah Khan","user":"shere0-0","type":"user"},{"_id":"640d0dbc8036cc2142273a83","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640d0dbc8036cc2142273a83/cicTWJVqqvQv_DgDucWgY.jpeg","isPro":false,"fullname":"Kaiyu Yue","user":"kaiyuyue","type":"user"},{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26099.md"}">
Papers
arxiv:2605.26099

Language Models Need Sleep

Published on May 25
· Submitted by
Sean McLeish
on May 26
Authors:
,
,
,

Abstract

A sleep-like consolidation mechanism for transformer models uses fast weights and recurrent passes to improve long-context processing while maintaining inference speed.

AI-generated summary

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.

Community

Paper submitter about 13 hours ago

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs N offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration N for our models improves performance, with the largest gains on examples that require deeper reasoning.

Do you think this could be used to help LLMs understand new words introduced to them at inference time? For example un-trained-on, obscure medical terms, explained using words it does know, could this help it understand those words more deeply? If LLMs already learn things by looking at relationships and storing those relationships in their weights, then it would make sense that using fast weights in this way might help.

Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/language-models-need-sleep-4798-2d8affbd
Covers the executive summary, detailed methodology, and practical applications.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.26099
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26099 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26099 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26099 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers