Hugging Face Daily Papers · June 10, 2026 · 5 min read

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users' incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.</p>\n","updatedAt":"2026-06-10T02:27:40.452Z","author":{"_id":"63b3ebc7b7fec0adf64e3c8a","avatarUrl":"/avatars/2eb0ce27db0b83dd488bbc1ad3d45a1d.svg","fullname":"Zhu Xiangyang","name":"yyy127","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9165345430374146},"editors":["yyy127"],"editorAvatarUrls":["/avatars/2eb0ce27db0b83dd488bbc1ad3d45a1d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09068","authors":[{"_id":"6a278ac76dde1c5ef75bcfbf","user":{"_id":"67e002e53d11aa67c1f27205","avatarUrl":"/avatars/cd5911b29b900dd783aa74d373b3b26f.svg","isPro":false,"fullname":"Sicheng Wang","user":"sichengwang04","type":"user","name":"sichengwang04"},"name":"Sicheng Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-09T12:42:03.297Z","hidden":false},{"_id":"6a278ac76dde1c5ef75bcfc0","user":{"_id":"63b3ebc7b7fec0adf64e3c8a","avatarUrl":"/avatars/2eb0ce27db0b83dd488bbc1ad3d45a1d.svg","isPro":false,"fullname":"Zhu Xiangyang","user":"yyy127","type":"user","name":"yyy127"},"name":"Xiangyang Zhu","status":"claimed_verified","statusLastChangedAt":"2026-06-09T12:42:01.127Z","hidden":false},{"_id":"6a278ac76dde1c5ef75bcfc1","name":"Han Wang","hidden":false},{"_id":"6a278ac76dde1c5ef75bcfc2","name":"Zongrui Wang","hidden":false},{"_id":"6a278ac76dde1c5ef75bcfc3","name":"Yuan Tian","hidden":false},{"_id":"6a278ac76dde1c5ef75bcfc4","name":"Kaiwei Zhang","hidden":false},{"_id":"6a278ac76dde1c5ef75bcfc5","name":"Kaiyuan Ji","hidden":false},{"_id":"6a278ac76dde1c5ef75bcfc6","name":"Qi Jia","hidden":false},{"_id":"6a278ac76dde1c5ef75bcfc7","name":"Guangtao Zhai","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating","submittedOnDailyBy":{"_id":"63b3ebc7b7fec0adf64e3c8a","avatarUrl":"/avatars/2eb0ce27db0b83dd488bbc1ad3d45a1d.svg","isPro":false,"fullname":"Zhu Xiangyang","user":"yyy127","type":"user","name":"yyy127"},"summary":"Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users' incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.","upvotes":4,"discussionId":"6a278ac76dde1c5ef75bcfc8","githubRepo":"https://github.com/stay1to0/Sycophancy_Emergent_Misalignment_and_Gated_attention_FT","githubRepoAddedBy":"user","ai_summary":"Sycophancy fine-tuning contributes to emergent misalignment in language models, which can be reversed using Alignment Gating—a method that inserts learnable gates to identify and control unsafe responses while maintaining general capabilities.","ai_keywords":["fine-tuning","emergent misalignment","sycophancy fine-tuning","Alignment Gating","learnable gates","internal representations","unsafe responses","generalization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4,"organization":{"_id":"6747ee5decec679eafb90450","name":"ShanghaiAiLab","fullname":"shanghai ailab ","avatar":"https://www.gravatar.com/avatar/6cd2acf412ad103653d9ce14a1aacc19?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63b3ebc7b7fec0adf64e3c8a","avatarUrl":"/avatars/2eb0ce27db0b83dd488bbc1ad3d45a1d.svg","isPro":false,"fullname":"Zhu Xiangyang","user":"yyy127","type":"user"},{"_id":"67e002e53d11aa67c1f27205","avatarUrl":"/avatars/cd5911b29b900dd783aa74d373b3b26f.svg","isPro":false,"fullname":"Sicheng Wang","user":"sichengwang04","type":"user"},{"_id":"69449ace09484b95c2cd30e3","avatarUrl":"/avatars/516ec791caabe5592d42e7b5cf9b4172.svg","isPro":false,"fullname":"Han Wang","user":"nsksm","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6747ee5decec679eafb90450","name":"ShanghaiAiLab","fullname":"shanghai ailab ","avatar":"https://www.gravatar.com/avatar/6cd2acf412ad103653d9ce14a1aacc19?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09068.md"}">

Papers

arxiv:2606.09068

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Published on Jun 8

· Submitted by

Zhu Xiangyang on Jun 10

shanghai ailab

Upvote

Authors:

Sicheng Wang ,

Xiangyang Zhu ,

Abstract

Sycophancy fine-tuning contributes to emergent misalignment in language models, which can be reversed using Alignment Gating—a method that inserts learnable gates to identify and control unsafe responses while maintaining general capabilities.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

View arXiv page View PDF GitHub 4 Add to collection

Community

yyy127

Paper author Paper submitter about 15 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.09068

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09068 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09068 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers