Hugging Face Daily Papers · · 3 min read

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Accepted at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2026).</p>\n","updatedAt":"2026-06-25T11:07:00.818Z","author":{"_id":"630f299da119d49bc1df7633","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg","fullname":"Michele Papucci","name":"mpapucci","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8046120405197144},"editors":["mpapucci"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.25182","authors":[{"_id":"6a3cb1d9f3facdb67e9ff218","user":{"_id":"662a387f037e57983c6bc637","avatarUrl":"/avatars/5ffdbea0424f6d6b8861f3581b46906c.svg","isPro":false,"fullname":"Sofiia Nikolenko","user":"vnikolenko","type":"user","name":"vnikolenko"},"name":"Sofiia Nikolenko","status":"claimed_verified","statusLastChangedAt":"2026-06-25T09:14:42.846Z","hidden":false},{"_id":"6a3cb1d9f3facdb67e9ff219","name":"Michele Papucci","hidden":false},{"_id":"6a3cb1d9f3facdb67e9ff21a","name":"Mina Rezaei","hidden":false},{"_id":"6a3cb1d9f3facdb67e9ff21b","name":"Shireen Kudukkil Manchingal","hidden":false}],"publishedAt":"2026-06-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics","submittedOnDailyBy":{"_id":"630f299da119d49bc1df7633","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg","isPro":false,"fullname":"Michele Papucci","user":"mpapucci","type":"user","name":"mpapucci"},"summary":"Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model's internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens. We find that static aggregate statistics of prompt-level entropy (e.g., mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative. Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training. Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.","upvotes":3,"discussionId":"6a3cb1d9f3facdb67e9ff21c","githubRepo":"https://github.com/ssophiee/entropy-jailbreak-detection","githubRepoAddedBy":"user","ai_summary":"Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations.","ai_keywords":["Large Language Models","jailbreak attacks","safety training","token-level predictive entropy","logit lens","prompt-level entropy","monotonic rank-based trend scores","intermediate layers","architectural consistency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630f299da119d49bc1df7633","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg","isPro":false,"fullname":"Michele Papucci","user":"mpapucci","type":"user"},{"_id":"63de2c5fe742e86dc91951ec","avatarUrl":"/avatars/2764102e0f92e5374c299bd403dd8677.svg","isPro":false,"fullname":"Lucia Domenichelli","user":"LuciaD99","type":"user"},{"_id":"662a387f037e57983c6bc637","avatarUrl":"/avatars/5ffdbea0424f6d6b8861f3581b46906c.svg","isPro":false,"fullname":"Sofiia Nikolenko","user":"vnikolenko","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.25182.md","query":{}}">
Papers
arxiv:2606.25182

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Published on Jun 23
· Submitted by
Michele Papucci
on Jun 25
Authors:
,
,

Abstract

Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations.

Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model's internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens. We find that static aggregate statistics of prompt-level entropy (e.g., mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative. Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training. Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.

Community

Paper submitter about 15 hours ago

Accepted at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2026).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.25182
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.25182 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.25182 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.25182 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers