Hugging Face Daily Papers · May 21, 2026 · 5 min read

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We find that RLVR weight trajectories are extremely low-rank and highly predictable: (1) the majority of RLVR gains are captured by a rank-1 approximation of the parameter deltas, and (2) the magnitude of this rank-1 projection evolves near-linearly with training steps.\nTo exploit this structure, we propose RELEX (REinforcement Learning EXtrapolation), which first estimates the rank-1 subspace from a short observation window of RLVR training and then predicts future checkpoints via linear regression, with no learned model required.\nThis simple method shows promising potentials — using only 15–20% of RLVR training as observed prefix, RELEX matches or even surpasses full RLVR on both in-domain and out-of-domain evaluations across Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base.\nCheck out our artifacts for more details: 📚 Paper: <a href=\"https://arxiv.org/abs/2605.21468\" rel=\"nofollow\">https://arxiv.org/abs/2605.21468</a> 📝 Blog: <a href=\"https://weizhepei.notion.site/you-only-need-minimal-rlvr-training\" rel=\"nofollow\">https://weizhepei.notion.site/you-only-need-minimal-rlvr-training</a> 💻 Code: <a href=\"https://github.com/weizhepei/RELEX\" rel=\"nofollow\">https://github.com/weizhepei/RELEX</a> 🤗 Checkpoints: <a href=\"https://huggingface.co/relex-rlvr\">https://huggingface.co/relex-rlvr</a>\n","updatedAt":"2026-05-21T05:42:11.299Z","author":{"_id":"6526307af06ac0cf9a922e86","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/nchCipX-XWw2cnzYsU_Cv.jpeg","fullname":"Zhepei Wei","name":"weizhepei","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8309411406517029},"editors":["weizhepei"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/nchCipX-XWw2cnzYsU_Cv.jpeg"],"reactions":[{"reaction":"🔥","users":["yyuyi","ChengsongHuang","hunaiyue"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.21468","authors":[{"_id":"6a0e9963164dbbc68a26c64f","name":"Zhepei Wei","hidden":false},{"_id":"6a0e9963164dbbc68a26c650","name":"Xinyu Zhu","hidden":false},{"_id":"6a0e9963164dbbc68a26c651","name":"Wei-Lin Chen","hidden":false},{"_id":"6a0e9963164dbbc68a26c652","name":"Chengsong Huang","hidden":false},{"_id":"6a0e9963164dbbc68a26c653","name":"Jiaxin Huang","hidden":false},{"_id":"6a0e9963164dbbc68a26c654","name":"Yu Meng","hidden":false}],"publishedAt":"2026-05-20T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories","submittedOnDailyBy":{"_id":"6526307af06ac0cf9a922e86","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/nchCipX-XWw2cnzYsU_Cv.jpeg","isPro":false,"fullname":"Zhepei Wei","user":"weizhepei","type":"user","name":"weizhepei"},"summary":"Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20times beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a \"denoising\" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.","upvotes":29,"discussionId":"6a0e9963164dbbc68a26c655","projectPage":"https://weizhepei.notion.site/you-only-need-minimal-rlvr-training","githubRepo":"https://github.com/weizhepei/RELEX","githubRepoAddedBy":"user","ai_summary":"Reinforcement learning with verifiable rewards parameter trajectories exhibit low-rank structures that enable efficient extrapolation through a simple linear regression method, demonstrating superior performance with reduced computational requirements.","ai_keywords":["reinforcement learning with verifiable rewards","parameter trajectories","low-rank approximation","rank-1 approximation","linear regression","extrapolation","stochastic optimization noise","denoising effect"],"githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6526307af06ac0cf9a922e86","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/nchCipX-XWw2cnzYsU_Cv.jpeg","isPro":false,"fullname":"Zhepei Wei","user":"weizhepei","type":"user"},{"_id":"6467c196b990713c5033b796","avatarUrl":"/avatars/1e301c24b6fd1dd727decd2c8e19e578.svg","isPro":false,"fullname":"Zhendong Chu","user":"Wesley123","type":"user"},{"_id":"6633f39185b05e9a8e7c549c","avatarUrl":"/avatars/ee4df68daee8b6637d7ad86cba29cc2f.svg","isPro":false,"fullname":"shiyu","user":"sytmr","type":"user"},{"_id":"617aec6f6f37340367d5d7a1","avatarUrl":"/avatars/afa58f39896c5caef512675450c7d6ce.svg","isPro":false,"fullname":"Yu Meng","user":"yumeng5","type":"user"},{"_id":"64efbf39b3610349e84db417","avatarUrl":"/avatars/9e09a20e88f8cf5ce119efc0dadc3b7b.svg","isPro":false,"fullname":"Jiaxin Huang","user":"teapot123","type":"user"},{"_id":"65f36aa8b3236092b1425c88","avatarUrl":"/avatars/f6136bbdad678c72513bba358d122585.svg","isPro":false,"fullname":"Zhichen Zeng","user":"zhichenz","type":"user"},{"_id":"64fc20d899123d7698a30e61","avatarUrl":"/avatars/9231982cf70a0689f50accedf1004702.svg","isPro":false,"fullname":"Jinyuan Li","user":"jinyuan222","type":"user"},{"_id":"65e02d89574e5aa0e9ce3efa","avatarUrl":"/avatars/2ab152a10b21d81fb1defc726b8e951a.svg","isPro":false,"fullname":"Langlin Huang","user":"shrango","type":"user"},{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user"},{"_id":"623b290048f658f28aef79f7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1648044277149-noauth.jpeg","isPro":false,"fullname":"Xinyu Zhu","user":"TianHongZXY","type":"user"},{"_id":"6452faa03f80ad88c77c0efc","avatarUrl":"/avatars/2ce498d6a88f643dd91b6d56e14cb66e.svg","isPro":false,"fullname":"YUYI YANG","user":"yyuyi","type":"user"},{"_id":"67316c6cb9634ac96f65e1a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/uUY9F17MiN7NhGAg01Yom.png","isPro":false,"fullname":"PP","user":"PassionPrc","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.21468.md"}">

Papers

arxiv:2605.21468

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Published on May 20

· Submitted by

Zhepei Wei on May 21

Upvote

Authors:

Abstract

Reinforcement learning with verifiable rewards parameter trajectories exhibit low-rank structures that enable efficient extrapolation through a simple linear regression method, demonstrating superior performance with reduced computational requirements.

AI-generated summary

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving reasoning in large language models (LLMs), yet the underlying geometry of the resulting parameter trajectories remains underexplored. In this work, we demonstrate that RLVR weight trajectories are extremely low-rank and highly predictable. Specifically, we find that the majority of downstream performance gains are captured by a rank-1 approximation of the parameter deltas, where the magnitude of this projection evolves near-linearly with training steps. Motivated by this, we propose a simple and compute-efficient method RELEX (REinforcement Learning EXtrapolation), which estimates the rank-1 subspace from a short observation window and extrapolates future checkpoints via linear regression, with no learned model required. Across three models (i.e., Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base), RELEX produces checkpoints that match or exceed RLVR performance on both in-domain and out-of-domain benchmarks, requiring as few as 15% steps of full RLVR training. Remarkably, RELEX is able to extrapolate far beyond the observation window at no training cost, predicting checkpoints up to 10-20times beyond the observed prefix with continued improvement (e.g., observe only the first 50 steps and extrapolate to 1000 steps). Our ablation analysis confirms the minimalist sufficiency of RELEX: neither increasing the subspace rank nor employing non-linear modeling yields further gains in extrapolation. Finally, we show that RELEX's success stems from a "denoising" effect: by projecting updates onto the rank-1 subspace, the model discards stochastic optimization noise that would otherwise degrade performance during extrapolation. Our code is available at https://github.com/weizhepei/RELEX.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

weizhepei

Paper submitter about 7 hours ago

To exploit this structure, we propose RELEX (REinforcement Learning EXtrapolation), which first estimates the rank-1 subspace from a short observation window of RLVR training and then predicts future checkpoints via linear regression, with no learned model required.

This simple method shows promising potentials — using only 15–20% of RLVR training as observed prefix, RELEX matches or even surpasses full RLVR on both in-domain and out-of-domain evaluations across Qwen2.5-Math-1.5B, Qwen3-4B-Base, and Qwen3-8B-Base.

Check out our artifacts for more details:
📚 Paper: https://arxiv.org/abs/2605.21468
📝 Blog: https://weizhepei.notion.site/you-only-need-minimal-rlvr-training
💻 Code: https://github.com/weizhepei/RELEX
🤗 Checkpoints: https://huggingface.co/relex-rlvr

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.21468

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.21468 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.21468 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.21468 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers