Hugging Face Daily Papers · June 18, 2026 · 4 min read

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

SAE interventions are not as reliable as they look! 🧠🔒\nWe show that clamping unsafe SAE features does not reliably remove bad behaviors. Even with interventions active, suppressed behaviors can still recover through alternative residual-space directions. 🧩↩️\nFeature-level control ≠ behavioral safety. 🚨\nArxiv: <a href=\"https://arxiv.org/abs/2606.18322\" rel=\"nofollow\">https://arxiv.org/abs/2606.18322</a> Code: <a href=\"https://github.com/Mingyuee88/sae-post-intervention-recovery\" rel=\"nofollow\">https://github.com/Mingyuee88/sae-post-intervention-recovery</a> Project Page: <a href=\"https://mingyuee88.github.io/sae-post-intervention-recovery/\" rel=\"nofollow\">https://mingyuee88.github.io/sae-post-intervention-recovery/</a>\n","updatedAt":"2026-06-18T04:10:33.103Z","author":{"_id":"634cfebc350bcee9bed20a4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634cfebc350bcee9bed20a4d/fN47nN5rhw-HJaFLBZWQy.png","fullname":"Xingyi Yang","name":"adamdad","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":26,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8626680374145508},"editors":["adamdad"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/634cfebc350bcee9bed20a4d/fN47nN5rhw-HJaFLBZWQy.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18322","authors":[{"_id":"6a336d7c59127a45e2c1c623","name":"Mingyue Cui","hidden":false},{"_id":"6a336d7c59127a45e2c1c624","name":"Linghui Shen","hidden":false},{"_id":"6a336d7c59127a45e2c1c625","name":"Xingyi Yang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/634cfebc350bcee9bed20a4d/mAXe_3lX1FQ7WI-sW9r9k.mp4"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior","submittedOnDailyBy":{"_id":"634cfebc350bcee9bed20a4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634cfebc350bcee9bed20a4d/fN47nN5rhw-HJaFLBZWQy.png","isPro":false,"fullname":"Xingyi Yang","user":"adamdad","type":"user","name":"adamdad"},"summary":"Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified \"unsafe\" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.","upvotes":11,"discussionId":"6a336d7c59127a45e2c1c626","projectPage":"https://mingyuee88.github.io/sae-post-intervention-recovery/","githubRepo":"https://github.com/Mingyuee88/sae-post-intervention-recovery","githubRepoAddedBy":"user","ai_summary":"Sparse Autoencoders' feature-level interventions may appear successful but can be circumvented through residual-space optimization that recovers original behaviors, revealing limitations in using SAE features for complete behavioral control.","ai_keywords":["Sparse Autoencoders","residual-stream activations","latent-space defenses","feature-level intervention","post-intervention recovery","residual-space optimization","encoder-orthogonal updates","feature-map Jacobian","TPP","unlearning","IOI","refusal steering","behavioral completeness"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"646ecc368d316fde87b3b6e3","name":"PolyUHK","fullname":"The Hong Kong Polytechnic University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/646ecbc0cbb7bb996513e298/Akb4zKqIP9kb9PQoUPUmj.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"634cfebc350bcee9bed20a4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634cfebc350bcee9bed20a4d/fN47nN5rhw-HJaFLBZWQy.png","isPro":false,"fullname":"Xingyi Yang","user":"adamdad","type":"user"},{"_id":"687912c6333c3bc283c92840","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/2liMvKYwnE2xNGmpE_GwW.png","isPro":false,"fullname":"Shen Linghui","user":"Alsaaaaaaa","type":"user"},{"_id":"6944c26b5fa5a3b25029768f","avatarUrl":"/avatars/58884ee665e61be7453c336c86c75f35.svg","isPro":false,"fullname":"CUI Mingyue","user":"Mingyueee","type":"user"},{"_id":"694544da318574a19e06312f","avatarUrl":"/avatars/47b40b09a248e0dbce1e7e199500f912.svg","isPro":false,"fullname":"Wei Zhijia","user":"WeiZhijia0123","type":"user"},{"_id":"6944c67d110eda2bef24aeda","avatarUrl":"/avatars/387da224b2045bc6dc36fee35ee5c533.svg","isPro":false,"fullname":"Seo","user":"hyeeeee","type":"user"},{"_id":"6a33710f00aea30493787974","avatarUrl":"/avatars/739ccfd5d7a695de2d91980a2ff791e6.svg","isPro":false,"fullname":"Xujia Liu","user":"XavierLiu2","type":"user"},{"_id":"68c98942b3ce15f74bedde2e","avatarUrl":"/avatars/3e575e9b8c2655ba011915983e9f6bab.svg","isPro":false,"fullname":"CUI","user":"Mirror3050","type":"user"},{"_id":"683912b81de844b60fcff632","avatarUrl":"/avatars/1c6df8715759769585f3b2a875affbd5.svg","isPro":false,"fullname":"Chen Jiahao","user":"KenzoNeil","type":"user"},{"_id":"6944ceba6bedf03d6bf9b9db","avatarUrl":"/avatars/391e917b872b6a811b97a0abf810da82.svg","isPro":false,"fullname":"Dan Zhen","user":"DanZhen","type":"user"},{"_id":"668e740f1173ab43d9d9ed5e","avatarUrl":"/avatars/caa9b47c2a5f6d6d679759b8b234a0ab.svg","isPro":false,"fullname":"Zeqing Wang","user":"INV-WZQ","type":"user"},{"_id":"6a3390520adb87254292865d","avatarUrl":"/avatars/0715e2fafce46641b81c1aa55f8c594f.svg","isPro":false,"fullname":"myusername","user":"myusernamewence","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"646ecc368d316fde87b3b6e3","name":"PolyUHK","fullname":"The Hong Kong Polytechnic University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/646ecbc0cbb7bb996513e298/Akb4zKqIP9kb9PQoUPUmj.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18322.md","query":{}}">

Papers

arxiv:2606.18322

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

Published on Jun 16

· Submitted by

Xingyi Yang on Jun 18

The Hong Kong Polytechnic University

Upvote

Authors:

Abstract

Sparse Autoencoders' feature-level interventions may appear successful but can be circumvented through residual-space optimization that recovers original behaviors, revealing limitations in using SAE features for complete behavioral control.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

adamdad

Paper submitter about 5 hours ago

SAE interventions are not as reliable as they look! 🧠🔒

We show that clamping unsafe SAE features does not reliably remove bad behaviors. Even with interventions active, suppressed behaviors can still recover through alternative residual-space directions. 🧩↩️

Feature-level control ≠ behavioral safety. 🚨

Arxiv: https://arxiv.org/abs/2606.18322
Code: https://github.com/Mingyuee88/sae-post-intervention-recovery
Project Page: https://mingyuee88.github.io/sae-post-intervention-recovery/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.18322

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.18322 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18322 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18322 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers