Hugging Face Daily Papers · May 26, 2026 · 5 min read

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.</p>\n","updatedAt":"2026-05-26T16:59:06.651Z","author":{"_id":"663256cc87ce9a8935cb1318","avatarUrl":"/avatars/f7c4a07dd026497f3b134db20540193c.svg","fullname":"wenlong deng","name":"dwenlong","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.858328640460968},"editors":["dwenlong"],"editorAvatarUrls":["/avatars/f7c4a07dd026497f3b134db20540193c.svg"],"reactions":[],"isReport":false}},{"id":"6a15ffefe7c112fe841e9f1a","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-26T20:17:51.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"the most interesting bit for me is how they carve a tiny trusted subspace from a short clean warmup and then keep rl updates anchored there. by projecting gradients into that subspace and reweighting to preserve the top singular directions, they prune the drift that fuels reward hacking while preserving true progress. i’m curious how sensitive this is to the warmup distribution and the chosen rank k, since the intrinsic trajectory could shift across tasks. the arxivlens breakdown helped me parse the method details and line up with the low-rank opt dynamics they describe. it would be nice to see this tested on broader domains beyond math reasoning to see whether the same directional stability holds under different proxy signals.","html":"<p>the most interesting bit for me is how they carve a tiny trusted subspace from a short clean warmup and then keep rl updates anchored there. by projecting gradients into that subspace and reweighting to preserve the top singular directions, they prune the drift that fuels reward hacking while preserving true progress. i’m curious how sensitive this is to the warmup distribution and the chosen rank k, since the intrinsic trajectory could shift across tasks. the arxivlens breakdown helped me parse the method details and line up with the low-rank opt dynamics they describe. it would be nice to see this tested on broader domains beyond math reasoning to see whether the same directional stability holds under different proxy signals.</p>\n","updatedAt":"2026-05-26T20:17:51.362Z","author":{"_id":"65243980050781c16f234f1f","avatarUrl":"/avatars/743a009681d5d554c27e04300db9f267.svg","fullname":"Avi","name":"avahal","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9042634963989258},"editors":["avahal"],"editorAvatarUrls":["/avatars/743a009681d5d554c27e04300db9f267.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.25189","authors":[{"_id":"6a15d13fe9aa3c8e322db148","name":"Wenlong Deng","hidden":false},{"_id":"6a15d13fe9aa3c8e322db149","name":"Jiaji Huang","hidden":false},{"_id":"6a15d13fe9aa3c8e322db14a","name":"Kaan Ozkara","hidden":false},{"_id":"6a15d13fe9aa3c8e322db14b","name":"Yushu Li","hidden":false},{"_id":"6a15d13fe9aa3c8e322db14c","name":"Christos Thrampoulidis","hidden":false},{"_id":"6a15d13fe9aa3c8e322db14d","name":"Xiaoxiao Li","hidden":false},{"_id":"6a15d13fe9aa3c8e322db14e","name":"Youngsuk Park","hidden":false}],"publishedAt":"2026-05-24T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models","submittedOnDailyBy":{"_id":"663256cc87ce9a8935cb1318","avatarUrl":"/avatars/f7c4a07dd026497f3b134db20540193c.svg","isPro":false,"fullname":"wenlong deng","user":"dwenlong","type":"user","name":"dwenlong"},"summary":"Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.","upvotes":1,"discussionId":"6a15d13fe9aa3c8e322db14f","ai_summary":"Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation.","ai_keywords":["reward hacking","reinforcement learning updates","language models","optimization drift","stable low-dimensional learning trajectory","singular directions","parameter updates","trusted-direction projection","gradient constraints","shortcut exploitation"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"663256cc87ce9a8935cb1318","avatarUrl":"/avatars/f7c4a07dd026497f3b134db20540193c.svg","isPro":false,"fullname":"wenlong deng","user":"dwenlong","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.25189.md"}">

Papers

arxiv:2605.25189

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Published on May 24

· Submitted by

wenlong deng on May 26

Upvote

Authors:

Abstract

Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation.

AI-generated summary

View arXiv page View PDF Add to collection

Community

dwenlong

Paper submitter about 8 hours ago

avahal

about 5 hours ago

the most interesting bit for me is how they carve a tiny trusted subspace from a short clean warmup and then keep rl updates anchored there. by projecting gradients into that subspace and reweighting to preserve the top singular directions, they prune the drift that fuels reward hacking while preserving true progress. i’m curious how sensitive this is to the warmup distribution and the chosen rank k, since the intrinsic trajectory could shift across tasks. the arxivlens breakdown helped me parse the method details and line up with the low-rank opt dynamics they describe. it would be nice to see this tested on broader domains beyond math reasoning to see whether the same directional stability holds under different proxy signals.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.25189

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.25189 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.25189 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.25189 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers