ICML 2026 regular paper. This paper presents a reinforcement learning method for data mixture during the pre-training stage of LLM. Under the best circumstances, it can reduce the actual pre-training time by 60% without compromising the pre-training performance of LLM. The code can be found at <a href=\"https://github.com/DANG-ai/AC-ODM\" rel=\"nofollow\">https://github.com/DANG-ai/AC-ODM</a>.</p>\n","updatedAt":"2026-06-23T14:23:53.608Z","author":{"_id":"66189942f795e6fb97a38f43","avatarUrl":"/avatars/0e44490b15563c3320dcd382ec87b262.svg","fullname":"Chenhao Dang","name":"DDAI-D","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.870903730392456},"editors":["DDAI-D"],"editorAvatarUrls":["/avatars/0e44490b15563c3320dcd382ec87b262.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2505.23878","authors":[{"_id":"6a3a043efdcd3514343bb5b5","name":"Jing Ma","hidden":false},{"_id":"6a3a043efdcd3514343bb5b6","user":{"_id":"66189942f795e6fb97a38f43","avatarUrl":"/avatars/0e44490b15563c3320dcd382ec87b262.svg","isPro":false,"fullname":"Chenhao Dang","user":"DDAI-D","type":"user","name":"DDAI-D"},"name":"Chenhao Dang","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:56:27.297Z","hidden":false},{"_id":"6a3a043efdcd3514343bb5b7","name":"Mingjie Liao","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/66189942f795e6fb97a38f43/uJ_x44ywyEbygXQyWRSda.png"],"publishedAt":"2026-06-14T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"AC-ODM: Actor--Critic Online Data Mixing for Sample-Efficient LLM Pretraining","submittedOnDailyBy":{"_id":"66189942f795e6fb97a38f43","avatarUrl":"/avatars/0e44490b15563c3320dcd382ec87b262.svg","isPro":false,"fullname":"Chenhao Dang","user":"DDAI-D","type":"user","name":"DDAI-D"},"summary":"Optimizing pretraining data composition is pivotal for LLM generalization. While dynamic mixing outperforms static strategies by capturing evolving training dynamics, current methods fail to reconcile computational efficiency with sample efficiency and structural flexibility for diverse pipelines.We introduce Actor--Critic Online Data Mixing (AC-ODM), which approaches data mixing from a reinforcement learning perspective with a parameterized policy that we theoretically prove to act as a dynamic linear surrogate maximizing the constructive interference of gradients. To enhance practical flexibility, AC-ODM supports two operational modes: (i) a proxy mode for fixed, pre-prepared corpora, where a policy learned on a small model is transferred to a larger target; and (ii) a non-proxy mode for direct end-to-end training from scratch without priors. Empirically, AC-ODM significantly outperforms prior methods in convergence speed and downstream accuracy across various architectures. On Pythia-1B, it reaches optimal validation perplexity using up to 66% fewer training steps than competitive baselines, delivering a 27.5% relative improvement in MMLU accuracy and a 2.23 x higher pass@1 on HumanEval, all while incurring a virtually negligible (0.4%) per-step wall-clock increase and only 2% additional memory overhead. Code is available at https://github.com/DANG-ai/AC-ODM.","upvotes":1,"discussionId":"6a3a043efdcd3514343bb5b8","projectPage":"https://github.com/DANG-ai/AC-ODM","githubRepo":"https://github.com/DANG-ai/AC-ODM","githubRepoAddedBy":"user","ai_summary":"AC-ODM optimizes pretraining data composition for LLMs using reinforcement learning to improve convergence speed and downstream accuracy while maintaining computational efficiency.","ai_keywords":["pretraining data composition","LLM generalization","dynamic mixing","static strategies","reinforcement learning","parameterized policy","gradient interference","constructive interference","proxy mode","non-proxy mode","convergence speed","downstream accuracy","Pythia-1B","validation perplexity","MMLU accuracy","HumanEval","pass@1"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"66ce9d1f5e180b9b9c8e6f31","name":"opendatalab","fullname":"OpenDataLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66189942f795e6fb97a38f43","avatarUrl":"/avatars/0e44490b15563c3320dcd382ec87b262.svg","isPro":false,"fullname":"Chenhao Dang","user":"DDAI-D","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66ce9d1f5e180b9b9c8e6f31","name":"opendatalab","fullname":"OpenDataLab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/639c3afa7432f2f5d16b7296/yqxxBknyeqkGnYsjoaR4M.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2505/2505.23878.md","query":{}}">
AC-ODM: Actor--Critic Online Data Mixing for Sample-Efficient LLM Pretraining
Abstract
AC-ODM optimizes pretraining data composition for LLMs using reinforcement learning to improve convergence speed and downstream accuracy while maintaining computational efficiency.
Optimizing pretraining data composition is pivotal for LLM generalization. While dynamic mixing outperforms static strategies by capturing evolving training dynamics, current methods fail to reconcile computational efficiency with sample efficiency and structural flexibility for diverse pipelines.We introduce Actor--Critic Online Data Mixing (AC-ODM), which approaches data mixing from a reinforcement learning perspective with a parameterized policy that we theoretically prove to act as a dynamic linear surrogate maximizing the constructive interference of gradients. To enhance practical flexibility, AC-ODM supports two operational modes: (i) a proxy mode for fixed, pre-prepared corpora, where a policy learned on a small model is transferred to a larger target; and (ii) a non-proxy mode for direct end-to-end training from scratch without priors. Empirically, AC-ODM significantly outperforms prior methods in convergence speed and downstream accuracy across various architectures. On Pythia-1B, it reaches optimal validation perplexity using up to 66% fewer training steps than competitive baselines, delivering a 27.5% relative improvement in MMLU accuracy and a 2.23 x higher pass@1 on HumanEval, all while incurring a virtually negligible (0.4%) per-step wall-clock increase and only 2% additional memory overhead. Code is available at https://github.com/DANG-ai/AC-ODM.
Community
ICML 2026 regular paper. This paper presents a reinforcement learning method for data mixture during the pre-training stage of LLM. Under the best circumstances, it can reduce the actual pre-training time by 60% without compromising the pre-training performance of LLM. The code can be found at https://github.com/DANG-ai/AC-ODM.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2505.23878 in a model README.md to link it from this page.
Cite arxiv.org/abs/2505.23878 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2505.23878 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.