Hugging Face Daily Papers · June 23, 2026 · 4 min read

Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

As high-quality text grows scarce, language model pretraining is entering a data-constrained, compute-abundant regime that requires many epochs over a fixed corpus, a setting in which standard autoregressive (AR) training overfits and eventually degrades. This work aims to demystify training-time data augmentation as a remedy, systematically separating which augmented training views regularize many-epoch AR training from which fail or interfere.</p>\n","updatedAt":"2026-06-23T16:56:02.517Z","author":{"_id":"64a833d2f152bba4b550c913","avatarUrl":"/avatars/cff37a427c01c6b6691f588481d96416.svg","fullname":"Zhen Wang","name":"zhenwang9102","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8962283730506897},"editors":["zhenwang9102"],"editorAvatarUrls":["/avatars/cff37a427c01c6b6691f588481d96416.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.16246","authors":[{"_id":"6a3a8be3fdcd3514343bb8d7","name":"Michael K. Chen","hidden":false},{"_id":"6a3a8be3fdcd3514343bb8d8","name":"Xikun Zhang","hidden":false},{"_id":"6a3a8be3fdcd3514343bb8d9","name":"Fan Bai","hidden":false},{"_id":"6a3a8be3fdcd3514343bb8da","name":"Zhengding Hu","hidden":false},{"_id":"6a3a8be3fdcd3514343bb8db","user":{"_id":"64a833d2f152bba4b550c913","avatarUrl":"/avatars/cff37a427c01c6b6691f588481d96416.svg","isPro":false,"fullname":"Zhen Wang","user":"zhenwang9102","type":"user","name":"zhenwang9102"},"name":"Zhen Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:54:43.968Z","hidden":false}],"publishedAt":"2026-06-19T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining","submittedOnDailyBy":{"_id":"64a833d2f152bba4b550c913","avatarUrl":"/avatars/cff37a427c01c6b6691f588481d96416.svg","isPro":false,"fullname":"Zhen Wang","user":"zhenwang9102","type":"user","name":"zhenwang9102"},"summary":"As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate training-time data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction (x_{t+i} for i > 1). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime~\\footnote{All code and data are available at https://github.com/ michaelchen-lab/ data-augmentations-for-pretraining.","upvotes":2,"discussionId":"6a3a8be3fdcd3514343bb8dc","githubRepo":"https://github.com/michaelchen-lab/data-augmentations-for-pretraining","githubRepoAddedBy":"user","ai_summary":"Training-time data augmentation techniques help mitigate overfitting in autoregressive language model pretraining by delaying performance deterioration and improving final model quality when training on fixed datasets for many epochs.","ai_keywords":["autoregressive pretraining","overfitting","data augmentation","token-level noise","sequence permutations","target offset prediction","validation loss","multi-epoch training","data-constrained regime"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"697e87d12cc19315a8497001","name":"UCSanDiego","fullname":"University of California at San Diego","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/697e8687c00f332cf492d29e/KUQpvngxP4r9oBSDZwIwZ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64a833d2f152bba4b550c913","avatarUrl":"/avatars/cff37a427c01c6b6691f588481d96416.svg","isPro":false,"fullname":"Zhen Wang","user":"zhenwang9102","type":"user"},{"_id":"697c8b15a7f796854ef333c4","avatarUrl":"/avatars/94de3a736fac914944f1b57609e3819a.svg","isPro":false,"fullname":"Joel Wang","user":"joelhenwang","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"697e87d12cc19315a8497001","name":"UCSanDiego","fullname":"University of California at San Diego","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/697e8687c00f332cf492d29e/KUQpvngxP4r9oBSDZwIwZ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.16246.md","query":{}}">

Papers

arxiv:2606.16246

Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

Published on Jun 19

· Submitted by

Zhen Wang on Jun 23

University of California at San Diego

Upvote

Authors:

Zhen Wang

Abstract

Training-time data augmentation techniques help mitigate overfitting in autoregressive language model pretraining by delaying performance deterioration and improving final model quality when training on fixed datasets for many epochs.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate training-time data augmentation as a regularizer to mitigate this overfitting and enable productive training for hundreds of epochs on the same data. We introduce three orthogonal categories of augmentation for AR pretraining: token-level noise (masking, random replacement), sequence permutations (right-to-left prediction, Fill-in-the-Middle), and target offset prediction (x_{t+i} for i > 1). Through systematic ablations, we find that individual augmentations delay overfitting and lower validation loss relative to the baseline, with random token replacement achieving the best minimum loss among individual methods. Combining augmentation categories further lowers the minimum validation loss. Our experiments demonstrate that data augmentations mitigate AR pretraining's data inefficiency and offer a promising solution to the data-constrained regime~\footnote{All code and data are available at https://github.com/ michaelchen-lab/ data-augmentations-for-pretraining.

View arXiv page View PDF GitHub 3 Add to collection

Community

zhenwang9102

Paper author Paper submitter about 11 hours ago

•

edited about 8 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.16246

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.16246 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.16246 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.16246 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers