Hugging Face Daily Papers · · 4 min read

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Rubric-based RL uses an LLM-as-a-Judge (LaaJ) to score model outputs against rubrics as rewards. Policy models can exploit latent biases in the judge, leading to reward hacking and unsafe or ineffective training. In real-world settings these hacking behaviors are subtle, entangled with multiple judge biases, and hard to analyze.</p>\n<p>CHERRL is a controllable hacking environment for rubric-based RL. By injecting known biases into the LaaJ, CHERRL enables:</p>\n<p>Stable reproduction of reward hacking from a clean starting point<br>Explicit observation of reward divergence between the biased and unbiased judges<br>Precise identification of hacking onset step<br>To demonstrate its utility, we analyze judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system (RHDA) for automatically detecting reward hacking onset from training logs.</p>\n","updatedAt":"2026-06-04T07:23:28.049Z","author":{"_id":"65fd6e54d7ef9a5a419608da","avatarUrl":"/avatars/b7de1aad32faa5b3b68e4dd7825d4924.svg","fullname":"Hao Zhuoyuan 郝卓远","name":"larry2210","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9107052683830261},"editors":["larry2210"],"editorAvatarUrls":["/avatars/b7de1aad32faa5b3b68e4dd7825d4924.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.04923","authors":[{"_id":"6a2127bc15100c5272a8481d","name":"Xuekang Wang","hidden":false},{"_id":"6a2127bc15100c5272a8481e","name":"Zhuoyuan Hao","hidden":false},{"_id":"6a2127bc15100c5272a8481f","name":"Shuo Hou","hidden":false},{"_id":"6a2127bc15100c5272a84820","name":"Hao Peng","hidden":false},{"_id":"6a2127bc15100c5272a84821","name":"Juanzi Li","hidden":false},{"_id":"6a2127bc15100c5272a84822","name":"Xiaozhi Wang","hidden":false}],"publishedAt":"2026-06-03T00:00:00.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning","submittedOnDailyBy":{"_id":"65fd6e54d7ef9a5a419608da","avatarUrl":"/avatars/b7de1aad32faa5b3b68e4dd7825d4924.svg","isPro":false,"fullname":"Hao Zhuoyuan 郝卓远","user":"larry2210","type":"user","name":"larry2210"},"summary":"Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.","upvotes":16,"discussionId":"6a2127bd15100c5272a84823","githubRepo":"https://github.com/THUAIS-Lab/CHERRL","githubRepoAddedBy":"user","ai_summary":"CHERRL is a controlled environment for studying reward hacking in rubric-based reinforcement learning with LLM judges, enabling detection and analysis of subtle bias exploitation patterns.","ai_keywords":["reinforcement learning","LLM-as-a-Judge","reward hacking","rubric-based RL","controllable hacking environment","reward divergence","bias injection","agent-based system","training logs"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f2cb29635bf4e07303d26e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66f2cb29635bf4e07303d26e/2Hn_pOrk4CRCSkOQZVIRH.jpeg","isPro":false,"fullname":"Shuo Hou","user":"WilliamHS","type":"user"},{"_id":"648c48e7a47a0850f9458b67","avatarUrl":"/avatars/47d71d80f9901313feb0199c37296389.svg","isPro":false,"fullname":"Xiaozhi Wang","user":"wangxz098","type":"user"},{"_id":"65fd6e54d7ef9a5a419608da","avatarUrl":"/avatars/b7de1aad32faa5b3b68e4dd7825d4924.svg","isPro":false,"fullname":"Hao Zhuoyuan 郝卓远","user":"larry2210","type":"user"},{"_id":"68f259196197fcdb1184f8d5","avatarUrl":"/avatars/21aa6ec27b23a8a23a1f896904c87d0c.svg","isPro":false,"fullname":"cjlswqsw","user":"cjlswqsw","type":"user"},{"_id":"686c97a85143f037f9eb9890","avatarUrl":"/avatars/62b01275e03845fe4062b110f865d775.svg","isPro":false,"fullname":"l1","user":"Chuanlight","type":"user"},{"_id":"6719077c16f2fd726fa4c0ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/uIo3QtiXxzeHeHy4jV4TM.jpeg","isPro":false,"fullname":"李二牛","user":"Bleader","type":"user"},{"_id":"6a212e921a5abc48f639d9b7","avatarUrl":"/avatars/251ba93c3c0cce7ac37509bfa186403d.svg","isPro":false,"fullname":"1","user":"Mingqwq111","type":"user"},{"_id":"636091e914657fb8cff9388e","avatarUrl":"/avatars/64590e6144784e88bc069a11188eebdc.svg","isPro":false,"fullname":"szs","user":"szs9915","type":"user"},{"_id":"6a15e7b6d57ab19bdd05440c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/lQZo0ck1qS6PnyZz3fSty.png","isPro":false,"fullname":"Elizabeth Campbell","user":"elizabethcampbe","type":"user"},{"_id":"66ddbbfabecd5c1c0cf4fdf5","avatarUrl":"/avatars/dd3d72f16581fbc0355b29c07430b04e.svg","isPro":false,"fullname":"Xuekang Wang","user":"wxk123","type":"user"},{"_id":"6a21301eee5da810c79f3376","avatarUrl":"/avatars/e1a30e5a5ca51635051cdeb4ed2ef113.svg","isPro":false,"fullname":"ting wang","user":"Wting724","type":"user"},{"_id":"6880f7fb214c44085b1cee4e","avatarUrl":"/avatars/561d5ccad86edf350bbe670c46b859bd.svg","isPro":false,"fullname":"jingtai","user":"jingtai04","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.04923.md"}">
Papers
arxiv:2606.04923

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Published on Jun 3
· Submitted by
Hao Zhuoyuan 郝卓远
on Jun 4
Authors:
,
,
,
,
,

Abstract

CHERRL is a controlled environment for studying reward hacking in rubric-based reinforcement learning with LLM judges, enabling detection and analysis of subtle bias exploitation patterns.

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.

Community

Paper submitter about 2 hours ago

Rubric-based RL uses an LLM-as-a-Judge (LaaJ) to score model outputs against rubrics as rewards. Policy models can exploit latent biases in the judge, leading to reward hacking and unsafe or ineffective training. In real-world settings these hacking behaviors are subtle, entangled with multiple judge biases, and hard to analyze.

CHERRL is a controllable hacking environment for rubric-based RL. By injecting known biases into the LaaJ, CHERRL enables:

Stable reproduction of reward hacking from a clean starting point
Explicit observation of reward divergence between the biased and unbiased judges
Precise identification of hacking onset step
To demonstrate its utility, we analyze judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system (RHDA) for automatically detecting reward hacking onset from training logs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.04923
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.04923 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.04923 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.04923 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers