Hugging Face Daily Papers · · 3 min read

The Unlearnability Phenomenon in RLVR for Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We show that a substantial fraction of hard problems remain unlearnable during RLVR of language models even when correct answers are occasionally sampled, and trace this to flawed internal representations that reward-based training cannot repair.</p>\n","updatedAt":"2026-05-21T04:19:25.384Z","author":{"_id":"61fe80a9b22c8e266a8af471","avatarUrl":"/avatars/3698db0a42a10cd27590145642151d0c.svg","fullname":"Yulin Chen","name":"cyl","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9577292203903198},"editors":["cyl"],"editorAvatarUrls":["/avatars/3698db0a42a10cd27590145642151d0c.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.16787","authors":[{"_id":"6a0e8631164dbbc68a26c5d1","name":"Yulin Chen","hidden":false},{"_id":"6a0e8631164dbbc68a26c5d2","name":"He He","hidden":false},{"_id":"6a0e8631164dbbc68a26c5d3","name":"Chen Zhao","hidden":false}],"publishedAt":"2026-05-16T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"The Unlearnability Phenomenon in RLVR for Language Models","submittedOnDailyBy":{"_id":"61fe80a9b22c8e266a8af471","avatarUrl":"/avatars/3698db0a42a10cd27590145642151d0c.svg","isPro":false,"fullname":"Yulin Chen","user":"cyl","type":"user","name":"cyl"},"summary":"Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at https://github.com/yulinchen99/unlearnability-rlvr.","upvotes":4,"discussionId":"6a0e8632164dbbc68a26c5d4","githubRepo":"https://github.com/yulinchen99/unlearnability-rlvr","githubRepoAddedBy":"user","ai_summary":"Research reveals that in reinforcement learning with verifiable reward, certain challenging examples remain unlearnable due to fundamental representation issues, despite correct rollouts being available, and existing optimization methods cannot address this limitation.","ai_keywords":["reinforcement learning","large language models","reward modeling","learning dynamics","unlearnability","gradient similarity","representation learning","cross-example gradient analysis","data augmentation"],"githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"67864e969ade3b15efd4044b","avatarUrl":"/avatars/3d3fdcc111515be5652f97f16e7d521d.svg","isPro":false,"fullname":"Chanuk Lee","user":"tally0818","type":"user"},{"_id":"69ccaee080ce56bde4120584","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/pdenkuBRdeDvSQMTLvyCJ.png","isPro":false,"fullname":"Victoria Jones","user":"isaacperez2","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.16787.md"}">
Papers
arxiv:2605.16787

The Unlearnability Phenomenon in RLVR for Language Models

Published on May 16
· Submitted by
Yulin Chen
on May 21
Authors:
,
,

Abstract

Research reveals that in reinforcement learning with verifiable reward, certain challenging examples remain unlearnable due to fundamental representation issues, despite correct rollouts being available, and existing optimization methods cannot address this limitation.

AI-generated summary

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at https://github.com/yulinchen99/unlearnability-rlvr.

Community

Paper submitter about 9 hours ago

We show that a substantial fraction of hard problems remain unlearnable during RLVR of language models even when correct answers are occasionally sampled, and trace this to flawed internal representations that reward-based training cannot repair.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.16787
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.16787 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.16787 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.16787 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers