Hugging Face Daily Papers · · 6 min read

Measuring the Depth of LLM Unlearning via Activation Patching

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We present the Unlearning Depth Score (UDS), a mechanistic metric for measuring how deeply target knowledge is erased from language models after unlearning. UDS uses two-stage activation patching to identify knowledge-encoding layers and quantify how much target knowledge remains recoverable from internal representations. Through a meta-evaluation of 20 metrics across 150 unlearned models and 8 unlearning methods, we show that UDS provides a more faithful and robust signal than output-level evaluations alone.</p>\n","updatedAt":"2026-06-02T11:12:10.917Z","author":{"_id":"6663459bd6bf635504ec7dfc","avatarUrl":"/avatars/8727370401076a42666f8d4a05cf463d.svg","fullname":"jaeunglee","name":"jaeunglee","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8889615535736084},"editors":["jaeunglee"],"editorAvatarUrls":["/avatars/8727370401076a42666f8d4a05cf463d.svg"],"reactions":[],"isReport":false}},{"id":"6a1f8aa850cb8e247eb2446a","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false},"createdAt":"2026-06-03T02:00:08.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning](https://huggingface.co/papers/2605.00364) (2026)\n* [PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning](https://huggingface.co/papers/2604.22076) (2026)\n* [Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter](https://huggingface.co/papers/2605.11685) (2026)\n* [MAAT: Multi-phase Adapter-Aware Targeted Unlearning](https://huggingface.co/papers/2605.30514) (2026)\n* [SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion](https://huggingface.co/papers/2605.07482) (2026)\n* [Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models](https://huggingface.co/papers/2605.03547) (2026)\n* [Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution](https://huggingface.co/papers/2605.15138) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.00364\">Unlearning What Matters: Token-Level Attribution for Precise Language Model Unlearning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.22076\">PrivUn: Unveiling Latent Ripple Effects and Shallow Forgetting in Privacy Unlearning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.11685\">Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30514\">MAAT: Multi-phase Adapter-Aware Targeted Unlearning</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07482\">SHRED: Retain-Set-Free Unlearning via Self-Distillation with Logit Demotion</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.03547\">Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.15138\">Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-06-03T02:00:08.352Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7109715342521667},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.24614","authors":[{"_id":"6a1eb8ea808ddbc3c7d44036","user":{"_id":"6663459bd6bf635504ec7dfc","avatarUrl":"/avatars/8727370401076a42666f8d4a05cf463d.svg","isPro":false,"fullname":"jaeunglee","user":"jaeunglee","type":"user","name":"jaeunglee"},"name":"Jaeung Lee","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:03:26.319Z","hidden":false},{"_id":"6a1eb8ea808ddbc3c7d44037","name":"Dohyun Kim","hidden":false},{"_id":"6a1eb8ea808ddbc3c7d44038","name":"Jaemin Jo","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6663459bd6bf635504ec7dfc/dOFZoSH3P7neRT8R8k2zQ.png"],"publishedAt":"2026-05-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Measuring the Depth of LLM Unlearning via Activation Patching","submittedOnDailyBy":{"_id":"6663459bd6bf635504ec7dfc","avatarUrl":"/avatars/8727370401076a42666f8d4a05cf463d.svg","isPro":false,"fullname":"jaeunglee","user":"jaeunglee","type":"user","name":"jaeunglee"},"summary":"Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score","upvotes":5,"discussionId":"6a1eb8eb808ddbc3c7d44039","projectPage":"https://gnueaj.github.io/unlearning-depth-score/","githubRepo":"https://github.com/gnueaj/unlearning-depth-score","githubRepoAddedBy":"user","ai_summary":"A new metric called Unlearning Depth Score (UDS) is introduced to evaluate how thoroughly knowledge has been removed from large language models, addressing limitations of previous methods that could not detect hidden knowledge in internal representations.","ai_keywords":["large language model","unlearning","privacy protection","AI safety","activation patching","retain model","unlearning depth score","internal representations","mechanistic depth","causal approach"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":7},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6663459bd6bf635504ec7dfc","avatarUrl":"/avatars/8727370401076a42666f8d4a05cf463d.svg","isPro":false,"fullname":"jaeunglee","user":"jaeunglee","type":"user"},{"_id":"66a12715dae8f7ffd63b0a1a","avatarUrl":"/avatars/4ab72e47573a0a8766920ed8ce3f8de7.svg","isPro":false,"fullname":"youbin kim","user":"ubin108","type":"user"},{"_id":"6795c5874b55103d8a8b62ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6795c5874b55103d8a8b62ce/KuVL8xNAUyKVw6yYRrD0E.jpeg","isPro":false,"fullname":"Yurim Jang","user":"Yurim0507","type":"user"},{"_id":"69ccb3b666dc2f65a553a16f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/599FhdrW-7RCL9HWR9N-7.png","isPro":false,"fullname":"류서윤","user":"lincolnclark76","type":"user"},{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":false,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.24614.md"}">
Papers
arxiv:2605.24614

Measuring the Depth of LLM Unlearning via Activation Patching

Published on May 23
· Submitted by
jaeunglee
on Jun 2
Authors:
,

Abstract

A new metric called Unlearning Depth Score (UDS) is introduced to evaluate how thoroughly knowledge has been removed from large language models, addressing limitations of previous methods that could not detect hidden knowledge in internal representations.

Large language model (LLM) unlearning has emerged as a crucial post-hoc mechanism for privacy protection and AI safety, yet auditing whether target knowledge is truly erased remains challenging. Existing output-level metrics fail to detect when this knowledge remains recoverable from internal representations. Recent white-box studies reveal such residual knowledge but often rely on auxiliary training or dataset-specific adaptations, leaving no generalizable metric. To address these limitations, we propose the Unlearning Depth Score (UDS), a metric that quantifies the mechanistic depth of unlearning via activation patching. UDS first identifies layers that encode the target knowledge using a retain model baseline, then measures how much of it is erased in the unlearned model on a 0-1 scale. In a meta-evaluation across 20 metrics on 150 unlearned models spanning 8 methods, UDS achieves the highest faithfulness and robustness, confirming our causal approach as the most reliable for unlearning evaluation. Case studies further reveal that white-box metrics can disagree at the layer level and that erasure depth varies across examples. We provide guidelines for integrating UDS into existing benchmarking frameworks and streamlining the evaluation pipeline. Code and data are available at https://github.com/gnueaj/unlearning-depth-score

Community

Paper author Paper submitter about 15 hours ago

We present the Unlearning Depth Score (UDS), a mechanistic metric for measuring how deeply target knowledge is erased from language models after unlearning. UDS uses two-stage activation patching to identify knowledge-encoding layers and quantify how much target knowledge remains recoverable from internal representations. Through a meta-evaluation of 20 metrics across 150 unlearned models and 8 unlearning methods, we show that UDS provides a more faithful and robust signal than output-level evaluations alone.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.24614
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.24614 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.24614 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers