Hugging Face Daily Papers · · 4 min read

HRM-Text: Efficient Pretraining Beyond Scaling

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

HRM-Text explores a different approach to language model pretraining: hierarchical recurrent computation, task-completion training, and latent-space reasoning.</p>\n<p>At just 1B parameters, HRM-Text achieves competitive performance with dramatically lower training cost and data requirements.</p>\n<p>1B parameters<br>40B unique tokens<br>~1 day of pretraining<br>~$1000 training cost</p>\n","updatedAt":"2026-05-21T03:19:24.382Z","author":{"_id":"61b6cbbdbfb266841ec0f24a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b6cbbdbfb266841ec0f24a/PHUVNOOMEw_R2CF3u-sMS.png","fullname":"One","name":"imone","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":54,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7599409818649292},"editors":["imone"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/61b6cbbdbfb266841ec0f24a/PHUVNOOMEw_R2CF3u-sMS.png"],"reactions":[{"reaction":"❤️","users":["diwank"],"count":1},{"reaction":"🚀","users":["diwank"],"count":1},{"reaction":"🔥","users":["diwank"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20613","authors":[{"_id":"6a0e78a9164dbbc68a26c507","name":"Guan Wang","hidden":false},{"_id":"6a0e78a9164dbbc68a26c508","name":"Changling Liu","hidden":false},{"_id":"6a0e78a9164dbbc68a26c509","name":"Chenyu Wang","hidden":false},{"_id":"6a0e78a9164dbbc68a26c50a","name":"Cai Zhou","hidden":false},{"_id":"6a0e78a9164dbbc68a26c50b","name":"Yuhao Sun","hidden":false},{"_id":"6a0e78a9164dbbc68a26c50c","name":"Yifei Wu","hidden":false},{"_id":"6a0e78a9164dbbc68a26c50d","name":"Shuai Zhen","hidden":false},{"_id":"6a0e78a9164dbbc68a26c50e","name":"Luca Scimeca","hidden":false},{"_id":"6a0e78a9164dbbc68a26c50f","name":"Yasin Abbasi Yadkori","hidden":false}],"publishedAt":"2026-05-20T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"HRM-Text: Efficient Pretraining Beyond Scaling","submittedOnDailyBy":{"_id":"61b6cbbdbfb266841ec0f24a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b6cbbdbfb266841ec0f24a/PHUVNOOMEw_R2CF3u-sMS.png","isPro":true,"fullname":"One","user":"imone","type":"user","name":"imone"},"summary":"The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.","upvotes":8,"discussionId":"6a0e78a9164dbbc68a26c510","projectPage":"https://github.com/sapientinc/HRM-Text","githubRepo":"https://github.com/sapientinc/HRM-Text","githubRepoAddedBy":"user","ai_summary":"A Hierarchical Recurrent Model architecture with specialized training on instruction-response pairs achieves competitive language modeling performance with significantly reduced computational requirements compared to traditional Transformer-based approaches.","ai_keywords":["Hierarchical Recurrent Model","Transformers","deep recurrence","MagicNorm","warmup deep credit assignment","instruction-response pairs","task-completion objective","PrefixLM masking","language modeling","compute-to-performance ratio"],"githubStars":576,"organization":{"_id":"682c3fa004d9cd55c3fc728a","name":"sapientinc","fullname":"Sapient AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b6cbbdbfb266841ec0f24a/8H2aKoS6VwKLVd7psUudO.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"61b6cbbdbfb266841ec0f24a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b6cbbdbfb266841ec0f24a/PHUVNOOMEw_R2CF3u-sMS.png","isPro":true,"fullname":"One","user":"imone","type":"user"},{"_id":"6615494716917dfdc645c44e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6615494716917dfdc645c44e/GGzgDi_WTW1Ci4CaDJd8I.jpeg","isPro":true,"fullname":"Daniel Fox","user":"FlameF0X","type":"user"},{"_id":"6093a02dc4a92d63a91c5236","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6093a02dc4a92d63a91c5236/yUte6V0FU0BvVFAbON-9n.jpeg","isPro":true,"fullname":"Diwank Tomer","user":"diwank","type":"user"},{"_id":"64cb54da1af278541d663708","avatarUrl":"/avatars/c44507cc92bb2e83154bad31b90ce6dd.svg","isPro":false,"fullname":"Xiaoye Qu","user":"Xiaoye08","type":"user"},{"_id":"69e5cef89c440e528551046d","avatarUrl":"/avatars/04ebaed27bd69c9df425f40f0a49beca.svg","isPro":false,"fullname":"leo","user":"duckking032","type":"user"},{"_id":"6a06f94888527b9fbf6c0f3d","avatarUrl":"/avatars/82fc527ce317e17751a7cab96ab03a43.svg","isPro":false,"fullname":"Changling Liu","user":"liucl26","type":"user"},{"_id":"60eeedbf50b60c406afc1291","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1649111275459-60eeedbf50b60c406afc1291.png","isPro":false,"fullname":"Samuel Arcadinho","user":"SSamDav","type":"user"},{"_id":"634ec067aae4bde2c8dfc86f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634ec067aae4bde2c8dfc86f/OQBLKcspofUqAzmEpvH0-.png","isPro":false,"fullname":"Yamata Zen","user":"yamatazen","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"682c3fa004d9cd55c3fc728a","name":"sapientinc","fullname":"Sapient AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/61b6cbbdbfb266841ec0f24a/8H2aKoS6VwKLVd7psUudO.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20613.md"}">
Papers
arxiv:2605.20613

HRM-Text: Efficient Pretraining Beyond Scaling

Published on May 20
· Submitted by
One
on May 21
Authors:
,
,
,
,
,
,
,
,

Abstract

A Hierarchical Recurrent Model architecture with specialized training on instruction-response pairs achieves competitive language modeling performance with significantly reduced computational requirements compared to traditional Transformer-based approaches.

AI-generated summary

The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.

Community

Paper submitter about 10 hours ago

HRM-Text explores a different approach to language model pretraining: hierarchical recurrent computation, task-completion training, and latent-space reasoning.

At just 1B parameters, HRM-Text achieves competitive performance with dramatically lower training cost and data requirements.

1B parameters
40B unique tokens
~1 day of pretraining
~$1000 training cost

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.20613
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 3

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers