SAE interventions are not as reliable as they look! 🧠🔒</p>\n<p>We show that clamping unsafe SAE features does <strong>not</strong> reliably remove bad behaviors. Even with interventions active, suppressed behaviors can still recover through alternative residual-space directions. 🧩↩️</p>\n<p>Feature-level control ≠ behavioral safety. 🚨</p>\n<p>Arxiv: <a href=\"https://arxiv.org/abs/2606.18322\" rel=\"nofollow\">https://arxiv.org/abs/2606.18322</a><br>Code: <a href=\"https://github.com/Mingyuee88/sae-post-intervention-recovery\" rel=\"nofollow\">https://github.com/Mingyuee88/sae-post-intervention-recovery</a><br>Project Page: <a href=\"https://mingyuee88.github.io/sae-post-intervention-recovery/\" rel=\"nofollow\">https://mingyuee88.github.io/sae-post-intervention-recovery/</a></p>\n","updatedAt":"2026-06-18T04:10:33.103Z","author":{"_id":"634cfebc350bcee9bed20a4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634cfebc350bcee9bed20a4d/fN47nN5rhw-HJaFLBZWQy.png","fullname":"Xingyi Yang","name":"adamdad","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":26,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8626680374145508},"editors":["adamdad"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/634cfebc350bcee9bed20a4d/fN47nN5rhw-HJaFLBZWQy.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18322","authors":[{"_id":"6a336d7c59127a45e2c1c623","name":"Mingyue Cui","hidden":false},{"_id":"6a336d7c59127a45e2c1c624","name":"Linghui Shen","hidden":false},{"_id":"6a336d7c59127a45e2c1c625","name":"Xingyi Yang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/634cfebc350bcee9bed20a4d/mAXe_3lX1FQ7WI-sW9r9k.mp4"],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior","submittedOnDailyBy":{"_id":"634cfebc350bcee9bed20a4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634cfebc350bcee9bed20a4d/fN47nN5rhw-HJaFLBZWQy.png","isPro":false,"fullname":"Xingyi Yang","user":"adamdad","type":"user","name":"adamdad"},"summary":"Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified \"unsafe\" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.","upvotes":11,"discussionId":"6a336d7c59127a45e2c1c626","projectPage":"https://mingyuee88.github.io/sae-post-intervention-recovery/","githubRepo":"https://github.com/Mingyuee88/sae-post-intervention-recovery","githubRepoAddedBy":"user","ai_summary":"Sparse Autoencoders' feature-level interventions may appear successful but can be circumvented through residual-space optimization that recovers original behaviors, revealing limitations in using SAE features for complete behavioral control.","ai_keywords":["Sparse Autoencoders","residual-stream activations","latent-space defenses","feature-level intervention","post-intervention recovery","residual-space optimization","encoder-orthogonal updates","feature-map Jacobian","TPP","unlearning","IOI","refusal steering","behavioral completeness"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"646ecc368d316fde87b3b6e3","name":"PolyUHK","fullname":"The Hong Kong Polytechnic University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/646ecbc0cbb7bb996513e298/Akb4zKqIP9kb9PQoUPUmj.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"634cfebc350bcee9bed20a4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634cfebc350bcee9bed20a4d/fN47nN5rhw-HJaFLBZWQy.png","isPro":false,"fullname":"Xingyi Yang","user":"adamdad","type":"user"},{"_id":"687912c6333c3bc283c92840","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/2liMvKYwnE2xNGmpE_GwW.png","isPro":false,"fullname":"Shen Linghui","user":"Alsaaaaaaa","type":"user"},{"_id":"6944c26b5fa5a3b25029768f","avatarUrl":"/avatars/58884ee665e61be7453c336c86c75f35.svg","isPro":false,"fullname":"CUI Mingyue","user":"Mingyueee","type":"user"},{"_id":"694544da318574a19e06312f","avatarUrl":"/avatars/47b40b09a248e0dbce1e7e199500f912.svg","isPro":false,"fullname":"Wei Zhijia","user":"WeiZhijia0123","type":"user"},{"_id":"6944c67d110eda2bef24aeda","avatarUrl":"/avatars/387da224b2045bc6dc36fee35ee5c533.svg","isPro":false,"fullname":"Seo","user":"hyeeeee","type":"user"},{"_id":"6a33710f00aea30493787974","avatarUrl":"/avatars/739ccfd5d7a695de2d91980a2ff791e6.svg","isPro":false,"fullname":"Xujia Liu","user":"XavierLiu2","type":"user"},{"_id":"68c98942b3ce15f74bedde2e","avatarUrl":"/avatars/3e575e9b8c2655ba011915983e9f6bab.svg","isPro":false,"fullname":"CUI","user":"Mirror3050","type":"user"},{"_id":"683912b81de844b60fcff632","avatarUrl":"/avatars/1c6df8715759769585f3b2a875affbd5.svg","isPro":false,"fullname":"Chen Jiahao","user":"KenzoNeil","type":"user"},{"_id":"6944ceba6bedf03d6bf9b9db","avatarUrl":"/avatars/391e917b872b6a811b97a0abf810da82.svg","isPro":false,"fullname":"Dan Zhen","user":"DanZhen","type":"user"},{"_id":"668e740f1173ab43d9d9ed5e","avatarUrl":"/avatars/caa9b47c2a5f6d6d679759b8b234a0ab.svg","isPro":false,"fullname":"Zeqing Wang","user":"INV-WZQ","type":"user"},{"_id":"6a3390520adb87254292865d","avatarUrl":"/avatars/0715e2fafce46641b81c1aa55f8c594f.svg","isPro":false,"fullname":"myusername","user":"myusernamewence","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"646ecc368d316fde87b3b6e3","name":"PolyUHK","fullname":"The Hong Kong Polytechnic University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/646ecbc0cbb7bb996513e298/Akb4zKqIP9kb9PQoUPUmj.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18322.md","query":{}}">
SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior
Abstract
Sparse Autoencoders' feature-level interventions may appear successful but can be circumvented through residual-space optimization that recovers original behaviors, revealing limitations in using SAE features for complete behavioral control.
Sparse Autoencoders (SAEs) decompose residual-stream activations into interpretable features. Recent latent-space defenses increasingly rely on these decompositions, assuming that identified "unsafe" SAE features serve as actionable handles for monitoring and intervention. In this paradigm, clamping a specific harmful feature is expected to reliably prevent model misbehavior. However, we show that this success may hide a recoverable failure mode: the clamp may block one visible route to a behavior without eliminating the behavior itself. We formulate this vulnerability as post-intervention recovery, a constrained residual-space optimization problem. Starting from the post-intervention residual state, we optimize residual perturbations to recover the pre-intervention behavior while preserving the post-intervention values of the targeted SAE features. Even under a strong threat model where the intervention remains active throughout optimization and generation, recovery remains possible. To rule out that recovery simply undoes the intervention, we use encoder-orthogonal updates for single-layer interventions and the corresponding feature-map Jacobian in the cross-layer setting. Across TPP, unlearning, IOI, and refusal steering experiments, this stress test reveals recoverable behavior despite successful feature-level intervention. Especially in the safety-critical refusal-steering setting, we achieve a 95.8% recovery rate on valid samples while keeping defended-feature relative drift to 0.131, substantially below suffix-based baselines. A recovery-path attribution analysis further localizes this recovery to the SAE reconstruction residual, the component left unexplained by the SAE. These results expose a gap between feature-level control and behavioral completeness: SAE features can support causal intervention, but controlling them does not guarantee control over the underlying behavior.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.18322 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.18322 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.18322 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.