Hugging Face Daily Papers · · 4 min read

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Accepted at the GEM Workshop @ ACL 2026</p>\n","updatedAt":"2026-06-03T04:23:51.801Z","author":{"_id":"6659154dd6898d357e086a21","avatarUrl":"/avatars/e01a7245305f939cc68dfd389211df53.svg","fullname":"Sachin Kumar","name":"techsachin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8651700615882874},"editors":["techsachin"],"editorAvatarUrls":["/avatars/e01a7245305f939cc68dfd389211df53.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27958","authors":[{"_id":"6a1fabb2e292c1c78ecb13f3","name":"Sachin Kumar","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6659154dd6898d357e086a21/5vdFbbqfM41fsT-Mf7mvu.png"],"publishedAt":"2026-05-27T04:51:55.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations","submittedOnDailyBy":{"_id":"6659154dd6898d357e086a21","avatarUrl":"/avatars/e01a7245305f939cc68dfd389211df53.svg","isPro":false,"fullname":"Sachin Kumar","user":"techsachin","type":"user","name":"techsachin"},"summary":"Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.","upvotes":0,"discussionId":"6a1fabb2e292c1c78ecb13f4","githubRepo":"https://github.com/techsachinkr/llm-deception-probe-stress-test","githubRepoAddedBy":"user","ai_summary":"Linear probes for deception detection in large language models fail under distributional shifts despite high performance on clean data, revealing that deception is encoded through distributed sub-threshold features rather than simple linear directions.","ai_keywords":["linear probes","LLM activations","deception-detection metrics","AUROC","distributional shift","Gemma 3 model family","cross-domain transfer","multi-dimensional probe analysis","permutation null baselines","entropy-residualization","stylistic shifts","geometric encoding","layer-mismatch","entropy proxy","convex conic hull"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27958.md"}">
Papers
arxiv:2605.27958

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

Published on May 27
· Submitted by
Sachin Kumar
on Jun 3
Authors:

Abstract

Linear probes for deception detection in large language models fail under distributional shifts despite high performance on clean data, revealing that deception is encoded through distributed sub-threshold features rather than simple linear directions.

Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.

Community

Paper submitter about 9 hours ago
This comment has been hidden (marked as Resolved)
Paper submitter about 9 hours ago

Accepted at the GEM Workshop @ ACL 2026

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.27958
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.27958 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.27958 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27958 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers