Accepted at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2026).</p>\n","updatedAt":"2026-06-25T11:07:00.818Z","author":{"_id":"630f299da119d49bc1df7633","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg","fullname":"Michele Papucci","name":"mpapucci","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8046120405197144},"editors":["mpapucci"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.25182","authors":[{"_id":"6a3cb1d9f3facdb67e9ff218","user":{"_id":"662a387f037e57983c6bc637","avatarUrl":"/avatars/5ffdbea0424f6d6b8861f3581b46906c.svg","isPro":false,"fullname":"Sofiia Nikolenko","user":"vnikolenko","type":"user","name":"vnikolenko"},"name":"Sofiia Nikolenko","status":"claimed_verified","statusLastChangedAt":"2026-06-25T09:14:42.846Z","hidden":false},{"_id":"6a3cb1d9f3facdb67e9ff219","name":"Michele Papucci","hidden":false},{"_id":"6a3cb1d9f3facdb67e9ff21a","name":"Mina Rezaei","hidden":false},{"_id":"6a3cb1d9f3facdb67e9ff21b","name":"Shireen Kudukkil Manchingal","hidden":false}],"publishedAt":"2026-06-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics","submittedOnDailyBy":{"_id":"630f299da119d49bc1df7633","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg","isPro":false,"fullname":"Michele Papucci","user":"mpapucci","type":"user","name":"mpapucci"},"summary":"Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model's internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens. We find that static aggregate statistics of prompt-level entropy (e.g., mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative. Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training. Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.","upvotes":3,"discussionId":"6a3cb1d9f3facdb67e9ff21c","githubRepo":"https://github.com/ssophiee/entropy-jailbreak-detection","githubRepoAddedBy":"user","ai_summary":"Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations.","ai_keywords":["Large Language Models","jailbreak attacks","safety training","token-level predictive entropy","logit lens","prompt-level entropy","monotonic rank-based trend scores","intermediate layers","architectural consistency"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630f299da119d49bc1df7633","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1671839656656-630f299da119d49bc1df7633.jpeg","isPro":false,"fullname":"Michele Papucci","user":"mpapucci","type":"user"},{"_id":"63de2c5fe742e86dc91951ec","avatarUrl":"/avatars/2764102e0f92e5374c299bd403dd8677.svg","isPro":false,"fullname":"Lucia Domenichelli","user":"LuciaD99","type":"user"},{"_id":"662a387f037e57983c6bc637","avatarUrl":"/avatars/5ffdbea0424f6d6b8861f3581b46906c.svg","isPro":false,"fullname":"Sofiia Nikolenko","user":"vnikolenko","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.25182.md","query":{}}">
What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics
Abstract
Jailbreak attacks expose vulnerabilities in aligned large language models, revealing that harmful intent is encoded in structured intermediate uncertainty dynamics rather than output representations.
Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is encoded within the model's internal representations. We investigate this question by analyzing token-level predictive entropy trajectories across layers of a frozen LLM using the logit lens. We find that static aggregate statistics of prompt-level entropy (e.g., mean, variance) carry little discriminative signal, whereas features capturing how entropy evolves across token positions, such as monotonic rank-based trend scores, are substantially more informative. Importantly, this signal is not uniform across model depth: it is concentrated in intermediate layers and degrades at the final layer, indicating that jailbreak-relevant structure is most pronounced in mid-network representations rather than at the output head. Across multiple models (Llama, Qwen, Gemma) and adversarial benchmarks, these entropy dynamics provide architecture-consistent separation without additional training. Together, our findings show that jailbreak behavior is reflected in structured intermediate uncertainty dynamics, clarifying both which entropy-derived features encode harmful intent and where in the network that signal is most pronounced.
Community
Accepted at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2026).
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.25182 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.25182 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.25182 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.