Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.</p>\n","updatedAt":"2026-05-18T02:58:05.065Z","author":{"_id":"5fc052241160c47d1d438556","avatarUrl":"/avatars/f508707fc92f1b42f7897b12b727754c.svg","fullname":"Boxi Cao","name":"Bowieee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9189637899398804},"editors":["Bowieee"],"editorAvatarUrls":["/avatars/f508707fc92f1b42f7897b12b727754c.svg"],"reactions":[],"isReport":false}},{"id":"6a0bc17831b667566c684f05","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false},"createdAt":"2026-05-19T01:48:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [P^2O: Joint Policy and Prompt Optimization](https://huggingface.co/papers/2603.21877) (2026)\n* [GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning](https://huggingface.co/papers/2604.20659) (2026)\n* [OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning](https://huggingface.co/papers/2604.18530) (2026)\n* [EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance](https://huggingface.co/papers/2605.04960) (2026)\n* [Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR](https://huggingface.co/papers/2603.26126) (2026)\n* [SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting](https://huggingface.co/papers/2604.10688) (2026)\n* [MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models](https://huggingface.co/papers/2604.16972) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2603.21877\">P^2O: Joint Policy and Prompt Optimization</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.20659\">GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.18530\">OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.04960\">EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2603.26126\">Beyond Where to Look: Trajectory-Guided Reinforcement Learning for Multimodal RLVR</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.10688\">SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.16972\">MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-19T01:48:40.039Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":357,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7372478246688843},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.14539","authors":[{"_id":"6a0a800875184a0d71e02678","name":"Mengjie Ren","hidden":false},{"_id":"6a0a800875184a0d71e02679","name":"Jie Lou","hidden":false},{"_id":"6a0a800875184a0d71e0267a","name":"Boxi Cao","hidden":false},{"_id":"6a0a800875184a0d71e0267b","name":"Xueru Wen","hidden":false},{"_id":"6a0a800875184a0d71e0267c","name":"Hongyu Lin","hidden":false},{"_id":"6a0a800875184a0d71e0267d","name":"Xianpei Han","hidden":false},{"_id":"6a0a800875184a0d71e0267e","name":"Le Sun","hidden":false},{"_id":"6a0a800875184a0d71e0267f","name":"Xing Yu","hidden":false},{"_id":"6a0a800875184a0d71e02680","name":"Yaojie Lu","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards","submittedOnDailyBy":{"_id":"5fc052241160c47d1d438556","avatarUrl":"/avatars/f508707fc92f1b42f7897b12b727754c.svg","isPro":false,"fullname":"Boxi Cao","user":"Bowieee","type":"user","name":"Bowieee"},"summary":"Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.","upvotes":4,"discussionId":"6a0a800975184a0d71e02681","ai_summary":"Correction-Oriented Policy Optimization extends reinforcement learning with verifiable rewards by converting failed trajectories into correction supervision, improving reasoning capabilities and error correction in language models.","ai_keywords":["Reinforcement Learning with Verifiable Rewards","policy optimization","on-policy failed trajectories","correction-oriented supervision","credit assignment","mathematical reasoning","code generation","pass@K"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5fc052241160c47d1d438556","avatarUrl":"/avatars/f508707fc92f1b42f7897b12b727754c.svg","isPro":false,"fullname":"Boxi Cao","user":"Bowieee","type":"user"},{"_id":"6983057214c5880cb86c7768","avatarUrl":"/avatars/740b6705ba9bc7bd33424bb43e153ecf.svg","isPro":false,"fullname":"Oliver Kowalski","user":"browser-kid","type":"user"},{"_id":"674572a99543fbaf3c63f35b","avatarUrl":"/avatars/6c891450c2ceeb7b034556548afc772d.svg","isPro":false,"fullname":"蔡正舟","user":"conctsai","type":"user"},{"_id":"6984dd7d40e2c84073af8286","avatarUrl":"/avatars/6548aa808ee225af20ea91b9fc890937.svg","isPro":false,"fullname":"Ashley Miller","user":"yoomxyag","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.14539.md"}">
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Abstract
Correction-Oriented Policy Optimization extends reinforcement learning with verifiable rewards by converting failed trajectories into correction supervision, improving reasoning capabilities and error correction in language models.
AI-generated summary
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.
Community
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model's own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model's ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model's intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.14539 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.14539 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.14539 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.