Hugging Face Daily Papers · · 4 min read

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We introduce NudgeRL, a method that enables the model to generate diverse reasoning paths during rollout through strategy nudging, thereby improving exploration. To effectively learn from the exploration induced by nudging, we further introduce the inter-intra group advantage.</p>\n","updatedAt":"2026-05-18T02:26:43.285Z","author":{"_id":"64b74920fe6a108d03fed767","avatarUrl":"/avatars/a2c05b809c36fa5fab8e1a43b3e67051.svg","fullname":"Minki Kang","name":"Nardien","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9096765518188477},"editors":["Nardien"],"editorAvatarUrls":["/avatars/a2c05b809c36fa5fab8e1a43b3e67051.svg"],"reactions":[{"reaction":"👍","users":["HwanChang0106"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15726","authors":[{"_id":"6a0a6e6f75184a0d71e02591","user":{"_id":"67864e969ade3b15efd4044b","avatarUrl":"/avatars/3d3fdcc111515be5652f97f16e7d521d.svg","isPro":false,"fullname":"Chanuk Lee","user":"tally0818","type":"user","name":"tally0818"},"name":"Chanuk Lee","status":"claimed_verified","statusLastChangedAt":"2026-05-18T09:40:56.773Z","hidden":false},{"_id":"6a0a6e6f75184a0d71e02592","user":{"_id":"638716c14e00d7fc0902fef4","avatarUrl":"/avatars/5fa8152f8c0e4e600d1a64802c3e0103.svg","isPro":false,"fullname":"Sangwoo Park","user":"Sangsang","type":"user","name":"Sangsang"},"name":"Sangwoo Park","status":"claimed_verified","statusLastChangedAt":"2026-05-18T09:40:54.696Z","hidden":false},{"_id":"6a0a6e6f75184a0d71e02593","name":"Minki Kang","hidden":false},{"_id":"6a0a6e6f75184a0d71e02594","name":"Sung Ju Hwang","hidden":false}],"publishedAt":"2026-05-15T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR","submittedOnDailyBy":{"_id":"64b74920fe6a108d03fed767","avatarUrl":"/avatars/a2c05b809c36fa5fab8e1a43b3e67051.svg","isPro":false,"fullname":"Minki Kang","user":"Nardien","type":"user","name":"Nardien"},"summary":"Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.","upvotes":27,"discussionId":"6a0a6e6f75184a0d71e02595","githubRepo":"https://github.com/tally0818/NudgeRL","githubRepoAddedBy":"user","ai_summary":"NudgeRL framework enhances reinforcement learning with verifiable rewards through structured exploration and strategy nudging to improve reasoning capabilities in large language models.","ai_keywords":["reinforcement learning with verifiable rewards","policy improvement","exploration","rollouts","strategy-level contexts","reward signal decomposition","distillation objective","oracle supervision","GRPO","mathematical benchmarks"],"githubStars":5,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64cfa0b9749587dbe01d0079","avatarUrl":"/avatars/93ca0a1d9c5578d052c5af0d4d1a0252.svg","isPro":false,"fullname":"Yumin Choi","user":"YuminChoi","type":"user"},{"_id":"65f06c8356cb8a32b41baf83","avatarUrl":"/avatars/ac66cc63f3abade4a859f7bf9357682a.svg","isPro":false,"fullname":"Jiongdao Jin","user":"jiongdao","type":"user"},{"_id":"66339dbf143f209fe1de6fe7","avatarUrl":"/avatars/25db7821f92fc149e7ac90017acb231b.svg","isPro":false,"fullname":"Silvia Zhang","user":"RealSilvia","type":"user"},{"_id":"695b4e723631aa29113d7b34","avatarUrl":"/avatars/0bf9e44919744f9a067573d9d14c05c8.svg","isPro":false,"fullname":"Ji","user":"543family","type":"user"},{"_id":"63036b6c5c70c21d0ea79d48","avatarUrl":"/avatars/a7eb03f5cbd4eaa09fe807bbed8bc0f7.svg","isPro":false,"fullname":"Jinheon Baek","user":"jinheon","type":"user"},{"_id":"666086f24a197fe2f97539be","avatarUrl":"/avatars/f57b9bf277e029bb479d651b9347a8bd.svg","isPro":false,"fullname":"Dohyeon Kim","user":"Dohyeon1","type":"user"},{"_id":"64b5457af249713053c736c5","avatarUrl":"/avatars/84cd17e11f20aee404f7ffadf659cd6f.svg","isPro":false,"fullname":"Yukyeong Lee","user":"leee99","type":"user"},{"_id":"638716c14e00d7fc0902fef4","avatarUrl":"/avatars/5fa8152f8c0e4e600d1a64802c3e0103.svg","isPro":false,"fullname":"Sangwoo Park","user":"Sangsang","type":"user"},{"_id":"67864e969ade3b15efd4044b","avatarUrl":"/avatars/3d3fdcc111515be5652f97f16e7d521d.svg","isPro":false,"fullname":"Chanuk Lee","user":"tally0818","type":"user"},{"_id":"64b74920fe6a108d03fed767","avatarUrl":"/avatars/a2c05b809c36fa5fab8e1a43b3e67051.svg","isPro":false,"fullname":"Minki Kang","user":"Nardien","type":"user"},{"_id":"6811d91633558457e1c2c7e0","avatarUrl":"/avatars/9e28daed4fd12bce6ced81d3e1d295f3.svg","isPro":false,"fullname":"Yeonjun Hwang","user":"hbhhyj","type":"user"},{"_id":"69511fcd1264a557883bf714","avatarUrl":"/avatars/bd0732a6448ece7865ac2090ea60d841.svg","isPro":false,"fullname":"DonghwanShin","user":"DHSHINNN","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15726.md"}">
Papers
arxiv:2605.15726

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

Published on May 15
· Submitted by
Minki Kang
on May 18
Authors:
,

Abstract

NudgeRL framework enhances reinforcement learning with verifiable rewards through structured exploration and strategy nudging to improve reasoning capabilities in large language models.

AI-generated summary

Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.

Community

Paper submitter about 24 hours ago

We introduce NudgeRL, a method that enables the model to generate diverse reasoning paths during rollout through strategy nudging, thereby improving exploration. To effectively learn from the exploration induced by nudging, we further introduce the inter-intra group advantage.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.15726
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15726 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15726 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15726 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers