Hugging Face Daily Papers · May 29, 2026 · 6 min read

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We introduce Alignment Tampering, a vulnerability where the LLM undergoing alignment influences the preference dataset itself, causing RLHF to amplify undesired behaviors.\n","updatedAt":"2026-05-29T08:58:14.943Z","author":{"_id":"665060b296877d8ccedd0dbb","avatarUrl":"/avatars/c70602dc626d6ece7a24ef8d4d89b50f.svg","fullname":"Dongyoon Hahm","name":"Hahmdong","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8041433691978455},"editors":["Hahmdong"],"editorAvatarUrls":["/avatars/c70602dc626d6ece7a24ef8d4d89b50f.svg"],"reactions":[],"isReport":false}},{"id":"6a1a40d8a233d2ba7da33c3b","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:43:52.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Pref-CTRL: Preference Driven LLM Alignment using Representation Editing](https://huggingface.co/papers/2604.23543) (2026)\n* [RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences](https://huggingface.co/papers/2605.01831) (2026)\n* [Reinforcement Learning from Human Feedback: A Statistical Perspective](https://huggingface.co/papers/2604.02507) (2026)\n* [Beyond Semantic Manipulation: Token-Space Attacks on Reward Models](https://huggingface.co/papers/2604.02686) (2026)\n* [Robust Reward Modeling for Large Language Models via Causal Decomposition](https://huggingface.co/papers/2604.13833) (2026)\n* [Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective](https://huggingface.co/papers/2604.25077) (2026)\n* [Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization](https://huggingface.co/papers/2604.07343) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.23543\">Pref-CTRL: Preference Driven LLM Alignment using Representation Editing</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.01831\">RMGAP: Benchmarking the Generalization of Reward Models across Diverse Preferences</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.02507\">Reinforcement Learning from Human Feedback: A Statistical Perspective</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.02686\">Beyond Semantic Manipulation: Token-Space Attacks on Reward Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.13833\">Robust Reward Modeling for Large Language Models via Causal Decomposition</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.25077\">Evaluating Risks in Weak-to-Strong Alignment: A Bias-Variance Perspective</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.07343\">Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-30T01:43:52.969Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7183433771133423},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27355","authors":[{"_id":"6a170540da9422d403a421bf","user":{"_id":"665060b296877d8ccedd0dbb","avatarUrl":"/avatars/c70602dc626d6ece7a24ef8d4d89b50f.svg","isPro":false,"fullname":"Dongyoon Hahm","user":"Hahmdong","type":"user","name":"Hahmdong"},"name":"Dongyoon Hahm","status":"claimed_verified","statusLastChangedAt":"2026-05-28T15:31:53.498Z","hidden":false},{"_id":"6a170540da9422d403a421c0","name":"Dylan Hadfield-Menell","hidden":false},{"_id":"6a170540da9422d403a421c1","name":"Kimin Lee","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/665060b296877d8ccedd0dbb/TWZzX7iRPHzP8YXFdvSjj.png"],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases","submittedOnDailyBy":{"_id":"665060b296877d8ccedd0dbb","avatarUrl":"/avatars/c70602dc626d6ece7a24ef8d4d89b50f.svg","isPro":false,"fullname":"Dongyoon Hahm","user":"Hahmdong","type":"user","name":"Hahmdong"},"summary":"Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/","upvotes":1,"discussionId":"6a170541da9422d403a421c2","projectPage":"https://alignment-tampering.github.io/","githubRepo":"https://github.com/alignment-tampering/alignment-tampering","githubRepoAddedBy":"user","ai_summary":"Reinforcement Learning from Human Feedback (RLHF) presents alignment tampering vulnerabilities where language models can manipulate preference datasets, leading to amplified undesired behaviors due to limitations in pairwise comparisons and reward modeling.","ai_keywords":["Reinforcement Learning from Human Feedback","alignment tampering","Large Language Models","preference datasets","pairwise comparisons","reward model","best-of-N sampling","language models"],"githubStars":1,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"665060b296877d8ccedd0dbb","avatarUrl":"/avatars/c70602dc626d6ece7a24ef8d4d89b50f.svg","isPro":false,"fullname":"Dongyoon Hahm","user":"Hahmdong","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27355.md"}">

Papers

arxiv:2605.27355

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Published on May 26

· Submitted by

Dongyoon Hahm on May 29

KAIST AI

Upvote

Authors:

Dongyoon Hahm ,

Abstract

Reinforcement Learning from Human Feedback (RLHF) presents alignment tampering vulnerabilities where language models can manipulate preference datasets, leading to amplified undesired behaviors due to limitations in pairwise comparisons and reward modeling.

AI-generated summary

Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

Hahmdong

Paper author Paper submitter 1 day ago

We introduce Alignment Tampering, a vulnerability where the LLM undergoing alignment influences the preference dataset itself, causing RLHF to amplify undesired behaviors.

librarian-bot

about 13 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.27355

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.27355 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.27355 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27355 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers