Hugging Face Daily Papers · June 25, 2026 · 3 min read

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Accepted at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2026).</p>\n","updatedAt":"2026-06-25T11:07:00.818Z","author":{"_id":"630f299da119d49bc1df7633","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg","fullname":"Michele Papucci","name":"mpapucci","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8046120405197144},"editors":["mpapucci"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.25182","authors":[{"_id":"6a3cb1d9f3facdb67e9ff218","user":{"_id":"662a387f037e57983c6bc637","avatarUrl":"/avatars/5ffdbea0424f6d6b8861f3581b46906c.svg","isPro":false,"fullname":"Sofiia Nikolenko","user":"vnikolenko","type":"user","name":"vnikolenko"},"name":"Sofiia Nikolenko","status":"claimed_verified","statusLastChangedAt":"2026-06-25T09:14:42.846Z","hidden":false},{"_id":"6a3cb1d9f3facdb67e9ff219","name":"Michele Papucci","hidden":false},{"_id":"6a3cb1d9f3facdb67e9ff21a","name":"Mina Rezaei","hidden":false},{"_id":"6a3cb1d9f3facdb67e9ff21b","name":"Shireen Kudukkil Manchingal","hidden":false}],"publishedAt":"2026-06-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics","submittedOnDailyBy":{"_id":"630f299da119d49bc1df7633","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg","isPro":false,"fullname":"Michele Papucci","user":"mpapucci","type":"user","name":"mpapucci"},"summary":"Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model's internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens. We find that static aggregate statistics of prompt-level entropy (e.g., mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative. Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training. Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.","upvotes":3,"discussionId":"6a3cb1d9f3facdb67e9ff21c","githubRepo":"https://github.com/ssophiee/entropy-jailbreak-detection","githubRepoAddedBy":"user","ai_summary":"Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations.","ai_keywords":["Large Language Models","jailbreak attacks","safety training","token-level predictive entropy","logit lens","prompt-level entropy","monotonic rank-based trend scores","intermediate layers","architectural consistency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630f299da119d49bc1df7633","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg","isPro":false,"fullname":"Michele Papucci","user":"mpapucci","type":"user"},{"_id":"63de2c5fe742e86dc91951ec","avatarUrl":"/avatars/2764102e0f92e5374c299bd403dd8677.svg","isPro":false,"fullname":"Lucia Domenichelli","user":"LuciaD99","type":"user"},{"_id":"662a387f037e57983c6bc637","avatarUrl":"/avatars/5ffdbea0424f6d6b8861f3581b46906c.svg","isPro":false,"fullname":"Sofiia Nikolenko","user":"vnikolenko","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.25182.md","query":{}}">

Papers

arxiv:2606.25182

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Published on Jun 23

· Submitted by

Michele Papucci on Jun 25

Upvote

Authors:

Sofiia Nikolenko ,

Abstract

Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model's internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens. We find that static aggregate statistics of prompt-level entropy (e.g., mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative. Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training. Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.

View arXiv page View PDF GitHub 1 Add to collection

Community

mpapucci

Paper submitter about 15 hours ago

Accepted at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2026).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.25182

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.25182 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.25182 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.25182 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers