We show that a substantial fraction of hard problems remain unlearnable during RLVR of language models even when correct answers are occasionally sampled, and trace this to flawed internal representations that reward-based training cannot repair.</p>\n","updatedAt":"2026-05-21T04:19:25.384Z","author":{"_id":"61fe80a9b22c8e266a8af471","avatarUrl":"/avatars/3698db0a42a10cd27590145642151d0c.svg","fullname":"Yulin Chen","name":"cyl","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9577292203903198},"editors":["cyl"],"editorAvatarUrls":["/avatars/3698db0a42a10cd27590145642151d0c.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.16787","authors":[{"_id":"6a0e8631164dbbc68a26c5d1","name":"Yulin Chen","hidden":false},{"_id":"6a0e8631164dbbc68a26c5d2","name":"He He","hidden":false},{"_id":"6a0e8631164dbbc68a26c5d3","name":"Chen Zhao","hidden":false}],"publishedAt":"2026-05-16T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"The Unlearnability Phenomenon in RLVR for Language Models","submittedOnDailyBy":{"_id":"61fe80a9b22c8e266a8af471","avatarUrl":"/avatars/3698db0a42a10cd27590145642151d0c.svg","isPro":false,"fullname":"Yulin Chen","user":"cyl","type":"user","name":"cyl"},"summary":"Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at https://github.com/yulinchen99/unlearnability-rlvr.","upvotes":4,"discussionId":"6a0e8632164dbbc68a26c5d4","githubRepo":"https://github.com/yulinchen99/unlearnability-rlvr","githubRepoAddedBy":"user","ai_summary":"Research reveals that in reinforcement learning with verifiable reward, certain challenging examples remain unlearnable due to fundamental representation issues, despite correct rollouts being available, and existing optimization methods cannot address this limitation.","ai_keywords":["reinforcement learning","large language models","reward modeling","learning dynamics","unlearnability","gradient similarity","representation learning","cross-example gradient analysis","data augmentation"],"githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62ea79dd01ed9b0e8f61ccd3","avatarUrl":"/avatars/70af83e0e267be39fcd5f23b85e2dafa.svg","isPro":false,"fullname":"Chengsong Huang","user":"ChengsongHuang","type":"user"},{"_id":"66d8512c54209e9101811e8e","avatarUrl":"/avatars/62dfd8e6261108f2508efe678d5a2a57.svg","isPro":false,"fullname":"M Saad Salman","user":"MSS444","type":"user"},{"_id":"67864e969ade3b15efd4044b","avatarUrl":"/avatars/3d3fdcc111515be5652f97f16e7d521d.svg","isPro":false,"fullname":"Chanuk Lee","user":"tally0818","type":"user"},{"_id":"69ccaee080ce56bde4120584","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/pdenkuBRdeDvSQMTLvyCJ.png","isPro":false,"fullname":"Victoria Jones","user":"isaacperez2","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.16787.md"}">
The Unlearnability Phenomenon in RLVR for Language Models
Abstract
Research reveals that in reinforcement learning with verifiable reward, certain challenging examples remain unlearnable due to fundamental representation issues, despite correct rollouts being available, and existing optimization methods cannot address this limitation.
AI-generated summary
Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at https://github.com/yulinchen99/unlearnability-rlvr.
Community
We show that a substantial fraction of hard problems remain unlearnable during RLVR of language models even when correct answers are occasionally sampled, and trace this to flawed internal representations that reward-based training cannot repair.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.16787 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.16787 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.16787 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.