Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.</p>\n","updatedAt":"2026-05-26T16:59:06.651Z","author":{"_id":"663256cc87ce9a8935cb1318","avatarUrl":"/avatars/f7c4a07dd026497f3b134db20540193c.svg","fullname":"wenlong deng","name":"dwenlong","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.858328640460968},"editors":["dwenlong"],"editorAvatarUrls":["/avatars/f7c4a07dd026497f3b134db20540193c.svg"],"reactions":[],"isReport":false}},{"id":"6a15ffefe7c112fe841e9f1a","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-26T20:17:51.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"the most interesting bit for me is how they carve a tiny trusted subspace from a short clean warmup and then keep rl updates anchored there. by projecting gradients into that subspace and reweighting to preserve the top singular directions, they prune the drift that fuels reward hacking while preserving true progress. i’m curious how sensitive this is to the warmup distribution and the chosen rank k, since the intrinsic trajectory could shift across tasks. the arxivlens breakdown helped me parse the method details and line up with the low-rank opt dynamics they describe. it would be nice to see this tested on broader domains beyond math reasoning to see whether the same directional stability holds under different proxy signals.","html":"<p>the most interesting bit for me is how they carve a tiny trusted subspace from a short clean warmup and then keep rl updates anchored there. by projecting gradients into that subspace and reweighting to preserve the top singular directions, they prune the drift that fuels reward hacking while preserving true progress. i’m curious how sensitive this is to the warmup distribution and the chosen rank k, since the intrinsic trajectory could shift across tasks. the arxivlens breakdown helped me parse the method details and line up with the low-rank opt dynamics they describe. it would be nice to see this tested on broader domains beyond math reasoning to see whether the same directional stability holds under different proxy signals.</p>\n","updatedAt":"2026-05-26T20:17:51.362Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9042634963989258},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.25189","authors":[{"_id":"6a15d13fe9aa3c8e322db148","name":"Wenlong Deng","hidden":false},{"_id":"6a15d13fe9aa3c8e322db149","name":"Jiaji Huang","hidden":false},{"_id":"6a15d13fe9aa3c8e322db14a","name":"Kaan Ozkara","hidden":false},{"_id":"6a15d13fe9aa3c8e322db14b","name":"Yushu Li","hidden":false},{"_id":"6a15d13fe9aa3c8e322db14c","name":"Christos Thrampoulidis","hidden":false},{"_id":"6a15d13fe9aa3c8e322db14d","name":"Xiaoxiao Li","hidden":false},{"_id":"6a15d13fe9aa3c8e322db14e","name":"Youngsuk Park","hidden":false}],"publishedAt":"2026-05-24T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models","submittedOnDailyBy":{"_id":"663256cc87ce9a8935cb1318","avatarUrl":"/avatars/f7c4a07dd026497f3b134db20540193c.svg","isPro":false,"fullname":"wenlong deng","user":"dwenlong","type":"user","name":"dwenlong"},"summary":"Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.","upvotes":1,"discussionId":"6a15d13fe9aa3c8e322db14f","ai_summary":"Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation.","ai_keywords":["reward hacking","reinforcement learning updates","language models","optimization drift","stable low-dimensional learning trajectory","singular directions","parameter updates","trusted-direction projection","gradient constraints","shortcut exploitation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"663256cc87ce9a8935cb1318","avatarUrl":"/avatars/f7c4a07dd026497f3b134db20540193c.svg","isPro":false,"fullname":"wenlong deng","user":"dwenlong","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.25189.md"}">
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
Abstract
Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation.
AI-generated summary
Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.
Community
Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.
the most interesting bit for me is how they carve a tiny trusted subspace from a short clean warmup and then keep rl updates anchored there. by projecting gradients into that subspace and reweighting to preserve the top singular directions, they prune the drift that fuels reward hacking while preserving true progress. i’m curious how sensitive this is to the warmup distribution and the chosen rank k, since the intrinsic trajectory could shift across tasks. the arxivlens breakdown helped me parse the method details and line up with the low-rank opt dynamics they describe. it would be nice to see this tested on broader domains beyond math reasoning to see whether the same directional stability holds under different proxy signals.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.25189 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.25189 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.25189 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.