Hugging Face Daily Papers · · 4 min read

Not only where, But when: Temporal Scheduling for RLVR

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

What if credit allocation in RLVR should evolve over training rather than remain fixed? We propose Temporal Scheduling, which gradually shifts optimization focus across trajectory and enables more stable policy evolution. The result is consistent gains on mathematical and general reasoning benchmarks with minimal additional complexity.</p>\n<p>Github: <a href=\"https://github.com/Jinghaoleven/RLVR-Schedule\" rel=\"nofollow\">https://github.com/Jinghaoleven/RLVR-Schedule</a><br>Hugging face: <a href=\"https://huggingface.co/datasets/JingHaoZ/OpenReasoning\">https://huggingface.co/datasets/JingHaoZ/OpenReasoning</a></p>\n","updatedAt":"2026-06-02T04:13:42.421Z","author":{"_id":"64673258fc6f6da8b119cab8","avatarUrl":"/avatars/36e025862984c7a86b97cee750ee2d04.svg","fullname":"SII-Jhao Zhang","name":"JingHaoZ","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.89122074842453},"editors":["JingHaoZ"],"editorAvatarUrls":["/avatars/36e025862984c7a86b97cee750ee2d04.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.25381","authors":[{"_id":"6a1acf57808ddbc3c7d42fd7","name":"Jinghao Zhang","hidden":false},{"_id":"6a1acf57808ddbc3c7d42fd8","name":"Ruilin Li","hidden":false},{"_id":"6a1acf57808ddbc3c7d42fd9","name":"Feng Zhao","hidden":false},{"_id":"6a1acf57808ddbc3c7d42fda","name":"Jiaqi Wang","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Not only where, But when: Temporal Scheduling for RLVR","submittedOnDailyBy":{"_id":"64673258fc6f6da8b119cab8","avatarUrl":"/avatars/36e025862984c7a86b97cee750ee2d04.svg","isPro":false,"fullname":"SII-Jhao Zhang","user":"JingHaoZ","type":"user","name":"JingHaoZ"},"summary":"Reinforcement learning with verifiable rewards (RLVR) has become a core technique for post-training of Large Language Models (LLMs). While policy optimization is driven by all sampled tokens under a globally broadcast scalar reward, the heterogeneous policy behaviors exhibited along trajectories are largely overlooked without differentiation. Existing works address this by credit allocation, including token-level advantage reweighting, and selective token optimization, however, the allocation criterion are principally stagnant throughout training, limiting resilient policy evolution. In this work, we argue that when learning signals are scheduled can be as important as where they are allocated across tokens, and introduce the temporal dimension that scheduling the credit allocation criteria over the course of RLVR optimization. We find that prioritizing targeted tokens emphasized with specific policy behaviors, and gradually attenuating toward general optimization leads to more stable and efficient learning dynamics. Furthermore, we show that simple trajectory percentiles provide a natural perspective for distinguishing policy behaviors, and works effectively with temporal scheduling. Our analysis reveals that standard optimization substantially sacrifices policy entropy when simultaneously accommodating heterogeneous behaviors, whereas temporal scheduling yields healthier policy evolution dynamics. Experiments across mathematical and general reasoning benchmarks demonstrate consistent improvements, suggesting that temporal scheduling constitutes a promising optimization dimension.","upvotes":2,"discussionId":"6a1acf57808ddbc3c7d42fdb","githubRepo":"https://github.com/Jinghaoleven/RLVR-Schedule","githubRepoAddedBy":"user","ai_summary":"Temporal scheduling of credit allocation criteria in reinforcement learning with verifiable rewards improves policy evolution and learning stability by prioritizing targeted tokens and gradually shifting toward general optimization.","ai_keywords":["reinforcement learning with verifiable rewards","policy optimization","credit allocation","token-level advantage reweighting","selective token optimization","temporal scheduling","policy entropy","trajectory percentiles"],"githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64673258fc6f6da8b119cab8","avatarUrl":"/avatars/36e025862984c7a86b97cee750ee2d04.svg","isPro":false,"fullname":"SII-Jhao Zhang","user":"JingHaoZ","type":"user"},{"_id":"698309b43ea114b8ee0f921e","avatarUrl":"/avatars/d4c942d84879cfb6663f6ea8f4b16995.svg","isPro":false,"fullname":"Phyo Phyo","user":"ugh-45","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.25381.md"}">
Papers
arxiv:2605.25381

Not only where, But when: Temporal Scheduling for RLVR

Published on May 25
· Submitted by
SII-Jhao Zhang
on Jun 2
Authors:
,
,
,

Abstract

Temporal scheduling of credit allocation criteria in reinforcement learning with verifiable rewards improves policy evolution and learning stability by prioritizing targeted tokens and gradually shifting toward general optimization.

AI-generated summary

Reinforcement learning with verifiable rewards (RLVR) has become a core technique for post-training of Large Language Models (LLMs). While policy optimization is driven by all sampled tokens under a globally broadcast scalar reward, the heterogeneous policy behaviors exhibited along trajectories are largely overlooked without differentiation. Existing works address this by credit allocation, including token-level advantage reweighting, and selective token optimization, however, the allocation criterion are principally stagnant throughout training, limiting resilient policy evolution. In this work, we argue that when learning signals are scheduled can be as important as where they are allocated across tokens, and introduce the temporal dimension that scheduling the credit allocation criteria over the course of RLVR optimization. We find that prioritizing targeted tokens emphasized with specific policy behaviors, and gradually attenuating toward general optimization leads to more stable and efficient learning dynamics. Furthermore, we show that simple trajectory percentiles provide a natural perspective for distinguishing policy behaviors, and works effectively with temporal scheduling. Our analysis reveals that standard optimization substantially sacrifices policy entropy when simultaneously accommodating heterogeneous behaviors, whereas temporal scheduling yields healthier policy evolution dynamics. Experiments across mathematical and general reasoning benchmarks demonstrate consistent improvements, suggesting that temporal scheduling constitutes a promising optimization dimension.

Community

Paper submitter about 6 hours ago

What if credit allocation in RLVR should evolve over training rather than remain fixed? We propose Temporal Scheduling, which gradually shifts optimization focus across trajectory and enables more stable policy evolution. The result is consistent gains on mathematical and general reasoning benchmarks with minimal additional complexity.

Github: https://github.com/Jinghaoleven/RLVR-Schedule
Hugging face: https://huggingface.co/datasets/JingHaoZ/OpenReasoning

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.25381
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.25381 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.25381 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers