We present the Unlearning Depth Score (UDS), a mechanistic metric for measuring how deeply target knowledge is erased from language models after unlearning. UDS uses two-stage activation patching to identify knowledge-encoding layers and quantify how much target knowledge remains recoverable from internal representations. Through a meta-evaluation of 20 metrics across 150 unlearned models and 8 unlearning methods, we show that UDS provides a more faithful and robust signal than output-level evaluations alone.</p>\n","updatedAt":"2026-06-02T11:12:10.917Z","author":{"_id":"6663459bd6bf635504ec7dfc","avatarUrl":"/avatars/8727370401076a42666f8d4a05cf463d.svg","fullname":"jaeunglee","name":"jaeunglee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8889615535736084},"editors":["jaeunglee"],"editorAvatarUrls":["/avatars/8727370401076a42666f8d4a05cf463d.svg"],"reactions":[],"isReport":false}},{"id":"6a1f8aa850cb8e247eb2446a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false},"createdAt":"2026-06-03T02:00:08.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning](https://huggingface.co/papers/2605.00364) (2026)\n* [PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning](https://huggingface.co/papers/2604.22076) (2026)\n* [Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter](https://huggingface.co/papers/2605.11685) (2026)\n* [MAAT: Multi-phase Adapter-Aware Targeted Unlearning](https://huggingface.co/papers/2605.30514) (2026)\n* [SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion](https://huggingface.co/papers/2605.07482) (2026)\n* [Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models](https://huggingface.co/papers/2605.03547) (2026)\n* [Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution](https://huggingface.co/papers/2605.15138) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.00364\">Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.22076\">PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.11685\">Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30514\">MAAT: Multi-phase Adapter-Aware Targeted Unlearning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07482\">SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.03547\">Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.15138\">Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-06-03T02:00:08.352Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7109715342521667},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.24614","authors":[{"_id":"6a1eb8ea808ddbc3c7d44036","user":{"_id":"6663459bd6bf635504ec7dfc","avatarUrl":"/avatars/8727370401076a42666f8d4a05cf463d.svg","isPro":false,"fullname":"jaeunglee","user":"jaeunglee","type":"user","name":"jaeunglee"},"name":"Jaeung Lee","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:03:26.319Z","hidden":false},{"_id":"6a1eb8ea808ddbc3c7d44037","name":"Dohyun Kim","hidden":false},{"_id":"6a1eb8ea808ddbc3c7d44038","name":"Jaemin Jo","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6663459bd6bf635504ec7dfc/dOFZoSH3P7neRT8R8k2zQ.png"],"publishedAt":"2026-05-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Measuring the Depth of LLM Unlearning via Activation Patching","submittedOnDailyBy":{"_id":"6663459bd6bf635504ec7dfc","avatarUrl":"/avatars/8727370401076a42666f8d4a05cf463d.svg","isPro":false,"fullname":"jaeunglee","user":"jaeunglee","type":"user","name":"jaeunglee"},"summary":"Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score","upvotes":5,"discussionId":"6a1eb8eb808ddbc3c7d44039","projectPage":"https://gnueaj.github.io/unlearning-depth-score/","githubRepo":"https://github.com/gnueaj/unlearning-depth-score","githubRepoAddedBy":"user","ai_summary":"A new metric called Unlearning Depth Score (UDS) is introduced to evaluate how thoroughly knowledge has been removed from large language models, addressing limitations of previous methods that could not detect hidden knowledge in internal representations.","ai_keywords":["large language model","unlearning","privacy protection","AI safety","activation patching","retain model","unlearning depth score","internal representations","mechanistic depth","causal approach"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":7},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6663459bd6bf635504ec7dfc","avatarUrl":"/avatars/8727370401076a42666f8d4a05cf463d.svg","isPro":false,"fullname":"jaeunglee","user":"jaeunglee","type":"user"},{"_id":"66a12715dae8f7ffd63b0a1a","avatarUrl":"/avatars/4ab72e47573a0a8766920ed8ce3f8de7.svg","isPro":false,"fullname":"youbin kim","user":"ubin108","type":"user"},{"_id":"6795c5874b55103d8a8b62ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6795c5874b55103d8a8b62ce/KuVL8xNAUyKVw6yYRrD0E.jpeg","isPro":false,"fullname":"Yurim Jang","user":"Yurim0507","type":"user"},{"_id":"69ccb3b666dc2f65a553a16f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/599FhdrW-7RCL9HWR9N-7.png","isPro":false,"fullname":"류서윤","user":"lincolnclark76","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":false,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.24614.md"}">
Measuring the Depth of LLM Unlearning via Activation Patching
Abstract
A new metric called Unlearning Depth Score (UDS) is introduced to evaluate how thoroughly knowledge has been removed from large language models, addressing limitations of previous methods that could not detect hidden knowledge in internal representations.
Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score
Community
We present the Unlearning Depth Score (UDS), a mechanistic metric for measuring how deeply target knowledge is erased from language models after unlearning. UDS uses two-stage activation patching to identify knowledge-encoding layers and quantify how much target knowledge remains recoverable from internal representations. Through a meta-evaluation of 20 metrics across 150 unlearned models and 8 unlearning methods, we show that UDS provides a more faithful and robust signal than output-level evaluations alone.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.24614 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.24614 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.