Hugging Face Daily Papers · May 20, 2026 · 12 min read

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Excited to share our new paper 🚀\nMid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models We study a simple question: Can we make RL more effective by first teaching models multiple correct ways to solve the same problem? Instead of reinforcing a single reasoning trajectory, can we expose the model to a richer space of valid approaches before RL begins?\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/h0LKb7m4zaZuBgRviZm1Q.gif\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/h0LKb7m4zaZuBgRviZm1Q.gif\" alt=\"Tweet01Hook\"></a>\n<hr>\nOur investigation is simple. Before RL, we mid-train the model on multiple correct ways of solving the same problem, so that when RL begins, it operates over a richer set of priors rather than a single narrow reasoning mode. Importantly, these reasoning traces are self-generated by the same base model that is later trained with RL. No human-written chains of thought, and no distillation from a stronger teacher model.\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/5w_P1OXk2VUkr0KwDsWz3.gif\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/5w_P1OXk2VUkr0KwDsWz3.gif\" alt=\"Tweet02Setup\"></a>\n<hr>\nTo make the solutions diverse, we use problem-solving heuristics inspired by George Pólya's How to Solve It. For each question, the model is prompted to solve it using different approaches: analogy, working backward, decomposition, introducing auxiliary elements, logical step-by-step justification, bright ideas, and more. This gives us structurally distinct reasoning traces for the same underlying problem.\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/CeQpay4Qa34ieUaomch0V.gif\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/CeQpay4Qa34ieUaomch0V.gif\" alt=\"Tweet03Heuristics\"></a>\n<hr>\nThe generated solutions are filtered in two steps. First, rule-based verification keeps only responses with the correct final answer. Then, a reward model scores how well the response follows the intended heuristic. The highest-scoring correct response per (question, heuristic) pair is selected, giving us multiple correct, heuristic-specific solution traces per question. 🧠\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/TWjDFzpIV6D1vA-cLxPdm.gif\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/TWjDFzpIV6D1vA-cLxPdm.gif\" alt=\"Tweet04Filter\"></a>\n<hr>\nWhy should this help RL? Our theoretical view: mid-training on n correct approaches creates multiple high-probability continuations at reasoning branch points, an N-modal distribution. Under a positive gradient, RL can meaningfully update across all N modes rather than sharpening a single one. Under a negative gradient, mass removed from the sampled approach redistributes to the remaining N-1 dominant modes, i.e., to the other valid approaches the model knows. This is the mechanism by which RL learns to combine the approaches introduced during mid-training.\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/S1NUrGjYv_5Ve2A9JQMEU.gif\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/S1NUrGjYv_5Ve2A9JQMEU.gif\" alt=\"Tweet05Theory\"></a>\n<hr>\nEmpirically, this improves GRPO-based RL. On Llama-3.2-3B-Instruct, models initialized with our heuristic-guided mid-training consistently outperform vanilla RL and STaR+RL across six math benchmarks, with gains becoming clearer at larger pass@k. At pass@64, the average improves from 44.21 for vanilla RL to 48.09 with n=16. 📊\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/H83O6-ogSxxeKs4fZN1pu.gif\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/H83O6-ogSxxeKs4fZN1pu.gif\" alt=\"Tweet06Results\"></a>\n<hr>\nOne of our most interesting findings: RL doesn't just use the individual approaches from mid-training. It composes them. We analyze reasoning traces using an LLM-based classifier across 64 Pólya-style heuristics. At n=16, RL-trained models combine multiple problem-solving approaches in 56.7% of chains, vs. only 23.3% before RL. This composition rate grows as n increases. Combinations like Bolzano + Decompose or Restate + Decompose + Carry-Out emerge consistently after RL, even though they were never observed together during mid-training. RL is doing the composition. 🔗\n<hr>\nFour additional findings from our analysis: Under a fixed instance-level budget, 16 approaches on 463 questions outperform 1 approach on 7,408 questions, around 7% relative improvement after RL. This means learning more problem solving approaches is more beneficial than learning to solve more problems, during mid-training.\nCorrectness vs Diversity:. Diverse but incorrect reasoning traces fall below vanilla RL. With more incorrect problem solving approaches, the performance worsens more. Diversity alone is not enough, and correctness is pivotal.\nMore diverse than distillation. Our self-generated data scores Vendi 13.81 vs. 10.95 for QwQ-32B distillation, and gives better post-RL performance despite coming from a much weaker model.\nGeneralizes beyond math. Despite math-centric heuristics, gains on HumanEval (code) and MuSR (narrative reasoning) show that Polya’s problems solving approaches transfer.\n<hr>\nTakeaway: RL performance depends not only on the RL stage itself, but also on the distribution the model is exposed to beforehand. Mid-training on diverse, self-generated, correct reasoning traces improves subsequent RL, and the effect is driven by RL learning to compose the approaches introduced during mid-training.\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/Ni78zSRisAyMjifKMgzQ3.gif\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6585eb3c7838841ee4ce9207/Ni78zSRisAyMjifKMgzQ3.gif\" alt=\"Tweet09Takeaway\"></a>\n","updatedAt":"2026-05-20T19:26:20.327Z","author":{"_id":"6585eb3c7838841ee4ce9207","avatarUrl":"/avatars/c4affd2675df6790c6187e4ed618efdf.svg","fullname":"Aswin Ravikumar Rangsasamy Veerasamy","name":"rrvaswin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8847615718841553},"editors":["rrvaswin"],"editorAvatarUrls":["/avatars/c4affd2675df6790c6187e4ed618efdf.svg"],"reactions":[],"isReport":false}},{"id":"6a0e63d19a68441d88fdce3c","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":358,"isUserFollowing":false},"createdAt":"2026-05-21T01:45:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [When Can LLMs Learn to Reason with Weak Supervision?](https://huggingface.co/papers/2604.18574) (2026)\n* [GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero](https://huggingface.co/papers/2605.15464) (2026)\n* [Exploration-Driven Optimization for Test-Time Large Language Model Reasoning](https://huggingface.co/papers/2605.09853) (2026)\n* [SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models](https://huggingface.co/papers/2604.16995) (2026)\n* [Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision](https://huggingface.co/papers/2604.12002) (2026)\n* [Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning](https://huggingface.co/papers/2605.06241) (2026)\n* [Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models](https://huggingface.co/papers/2604.25011) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.18574\">When Can LLMs Learn to Reason with Weak Supervision?</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.15464\">GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09853\">Exploration-Driven Optimization for Test-Time Large Language Model Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.16995\">SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.12002\">Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.06241\">Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.25011\">Why Does Reinforcement Learning Generalize? A Feature-Level Mechanistic Study of Post-Training in Large Language Models</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-21T01:45:53.965Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":358,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.741464376449585},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.08472","authors":[{"_id":"6a0deffad1ef9ecdf71c0e6e","name":"Aswin RRV","hidden":false},{"_id":"6a0deffad1ef9ecdf71c0e6f","name":"Jacob Dineen","hidden":false},{"_id":"6a0deffad1ef9ecdf71c0e70","name":"Divij Handa","hidden":false},{"_id":"6a0deffad1ef9ecdf71c0e71","name":"Mihir Parmar","hidden":false},{"_id":"6a0deffad1ef9ecdf71c0e72","name":"Ben Zhou","hidden":false},{"_id":"6a0deffad1ef9ecdf71c0e73","name":"Swaroop Mishra","hidden":false},{"_id":"6a0deffad1ef9ecdf71c0e74","name":"Chitta Baral","hidden":false}],"publishedAt":"2026-05-08T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models","submittedOnDailyBy":{"_id":"6585eb3c7838841ee4ce9207","avatarUrl":"/avatars/c4affd2675df6790c6187e4ed618efdf.svg","isPro":false,"fullname":"Aswin Ravikumar Rangsasamy Veerasamy","user":"rrvaswin","type":"user","name":"rrvaswin"},"summary":"The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.","upvotes":2,"discussionId":"6a0deffad1ef9ecdf71c0e75","ai_summary":"Using diverse self-generated data during mid-training based on Polya's problem-solving approaches improves reinforcement learning performance in language models across mathematical reasoning and out-of-distribution tasks.","ai_keywords":["Reinforcement Learning","Large Language Models","policy-gradient updates","fine-tuning","bootstrapped data-generation","self-generated data","mathematical reasoning","out-of-distribution tasks","code generation","narrative reasoning"],"organization":{"_id":"6514a7277cbb6ee93cd5296a","name":"ArizonaStateUniversity","fullname":"Arizona State University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66abe1109d1c619f9669d615/shmsoWBH4RaLdLyXmCenY.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6585eb3c7838841ee4ce9207","avatarUrl":"/avatars/c4affd2675df6790c6187e4ed618efdf.svg","isPro":false,"fullname":"Aswin Ravikumar Rangsasamy Veerasamy","user":"rrvaswin","type":"user"},{"_id":"652eec7766051611493547ad","avatarUrl":"/avatars/4fb8c46b0778b2b9b226b4a32073feff.svg","isPro":true,"fullname":"Jacob Dineen","user":"jdineen","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6514a7277cbb6ee93cd5296a","name":"ArizonaStateUniversity","fullname":"Arizona State University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66abe1109d1c619f9669d615/shmsoWBH4RaLdLyXmCenY.png"}}">

Papers

arxiv:2605.08472

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Published on May 8

· Submitted by

Aswin Ravikumar Rangsasamy Veerasamy on May 20

Arizona State University

Upvote

Authors:

Abstract

Using diverse self-generated data during mid-training based on Polya's problem-solving approaches improves reinforcement learning performance in language models across mathematical reasoning and out-of-distribution tasks.

AI-generated summary

The effectiveness of Reinforcement Learning (RL) in Large Language Models (LLMs) depends on the nature and diversity of the data used before and during RL. In particular, reasoning problems can often be approached in multiple ways that rely on different forms of reasoning, and exposure to only a limited range of such approaches in the training data may limit the effectiveness of RL. Motivated by this, we investigate using diverse self-generated data during mid-training as an intermediate step before RL training. Specifically, we adopt a bootstrapped data-generation framework guided by George Polya's problem-solving approaches for generating multiple variants of correct answers for each question in the training data, and then perform fine-tuning. We first provide a theoretical perspective on how mid-training on such data improves RL and explain how policy-gradient updates can incentivize combining multiple approaches. We then empirically demonstrate that RL-trained models initialized with our mid-training data achieve consistent improvements across various mathematical reasoning benchmarks and other OOD tasks like code generation and narrative reasoning. Overall, our investigative study shows that a language model learning multiple problem-solving approaches, through self-generated data helps subsequent RL.

View arXiv page View PDF Add to collection

Community

rrvaswin

Paper submitter about 7 hours ago

Excited to share our new paper 🚀

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
We study a simple question: Can we make RL more effective by first teaching models multiple correct ways to solve the same problem?
Instead of reinforcing a single reasoning trajectory, can we expose the model to a richer space of valid approaches before RL begins?

Our investigation is simple.
Before RL, we mid-train the model on multiple correct ways of solving the same problem, so that when RL begins, it operates over a richer set of priors rather than a single narrow reasoning mode.
Importantly, these reasoning traces are self-generated by the same base model that is later trained with RL. No human-written chains of thought, and no distillation from a stronger teacher model.

To make the solutions diverse, we use problem-solving heuristics inspired by George Pólya's How to Solve It.
For each question, the model is prompted to solve it using different approaches: analogy, working backward, decomposition, introducing auxiliary elements, logical step-by-step justification, bright ideas, and more.
This gives us structurally distinct reasoning traces for the same underlying problem.

The generated solutions are filtered in two steps.
First, rule-based verification keeps only responses with the correct final answer.
Then, a reward model scores how well the response follows the intended heuristic.
The highest-scoring correct response per (question, heuristic) pair is selected, giving us multiple correct, heuristic-specific solution traces per question. 🧠

Why should this help RL?
Our theoretical view: mid-training on n correct approaches creates multiple high-probability continuations at reasoning branch points, an N-modal distribution.
Under a positive gradient, RL can meaningfully update across all N modes rather than sharpening a single one. Under a negative gradient, mass removed from the sampled approach redistributes to the remaining N-1 dominant modes, i.e., to the other valid approaches the model knows.
This is the mechanism by which RL learns to combine the approaches introduced during mid-training.

Empirically, this improves GRPO-based RL.
On Llama-3.2-3B-Instruct, models initialized with our heuristic-guided mid-training consistently outperform vanilla RL and STaR+RL across six math benchmarks, with gains becoming clearer at larger pass@k.
At pass@64, the average improves from 44.21 for vanilla RL to 48.09 with n=16. 📊

One of our most interesting findings: RL doesn't just use the individual approaches from mid-training. It composes them.
We analyze reasoning traces using an LLM-based classifier across 64 Pólya-style heuristics. At n=16, RL-trained models combine multiple problem-solving approaches in 56.7% of chains, vs. only 23.3% before RL. This composition rate grows as n increases.
Combinations like Bolzano + Decompose or Restate + Decompose + Carry-Out emerge consistently after RL, even though they were never observed together during mid-training. RL is doing the composition. 🔗

Four additional findings from our analysis:
Under a fixed instance-level budget, 16 approaches on 463 questions outperform 1 approach on 7,408 questions, around 7% relative improvement after RL. This means learning more problem solving approaches is more beneficial than learning to solve more problems, during mid-training.

Correctness vs Diversity:. Diverse but incorrect reasoning traces fall below vanilla RL. With more incorrect problem solving approaches, the performance worsens more. Diversity alone is not enough, and correctness is pivotal.

More diverse than distillation. Our self-generated data scores Vendi 13.81 vs. 10.95 for QwQ-32B distillation, and gives better post-RL performance despite coming from a much weaker model.

Generalizes beyond math. Despite math-centric heuristics, gains on HumanEval (code) and MuSR (narrative reasoning) show that Polya’s problems solving approaches transfer.

Takeaway:
RL performance depends not only on the RL stage itself, but also on the distribution the model is exposed to beforehand.
Mid-training on diverse, self-generated, correct reasoning traces improves subsequent RL, and the effect is driven by RL learning to compose the approaches introduced during mid-training.

librarian-bot

15 minutes ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.08472 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.08472 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08472 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers