Accepted at the GEM Workshop @ ACL 2026</p>\n","updatedAt":"2026-06-03T04:23:51.801Z","author":{"_id":"6659154dd6898d357e086a21","avatarUrl":"/avatars/e01a7245305f939cc68dfd389211df53.svg","fullname":"Sachin Kumar","name":"techsachin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8651700615882874},"editors":["techsachin"],"editorAvatarUrls":["/avatars/e01a7245305f939cc68dfd389211df53.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27958","authors":[{"_id":"6a1fabb2e292c1c78ecb13f3","name":"Sachin Kumar","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6659154dd6898d357e086a21/5vdFbbqfM41fsT-Mf7mvu.png"],"publishedAt":"2026-05-27T04:51:55.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations","submittedOnDailyBy":{"_id":"6659154dd6898d357e086a21","avatarUrl":"/avatars/e01a7245305f939cc68dfd389211df53.svg","isPro":false,"fullname":"Sachin Kumar","user":"techsachin","type":"user","name":"techsachin"},"summary":"Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.","upvotes":0,"discussionId":"6a1fabb2e292c1c78ecb13f4","githubRepo":"https://github.com/techsachinkr/llm-deception-probe-stress-test","githubRepoAddedBy":"user","ai_summary":"Linear probes for deception detection in large language models fail under distributional shifts despite high performance on clean data, revealing that deception is encoded through distributed sub-threshold features rather than simple linear directions.","ai_keywords":["linear probes","LLM activations","deception-detection metrics","AUROC","distributional shift","Gemma 3 model family","cross-domain transfer","multi-dimensional probe analysis","permutation null baselines","entropy-residualization","stylistic shifts","geometric encoding","layer-mismatch","entropy proxy","convex conic hull"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27958.md"}">
Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
Abstract
Linear probes for deception detection in large language models fail under distributional shifts despite high performance on clean data, revealing that deception is encoded through distributed sub-threshold features rather than simple linear directions.
Linear probes trained on LLM activations are increasingly proposed as deception-detection metrics, yet report AUROC exceeding 0.96 on clean benchmarks while collapsing under distributional shift. This paper systematically pressure-tests probe-based metrics across the Gemma 3 model family (1B-27B parameters), diagnosing why they fail rather than merely documenting that they fail. We test four hypotheses about deception encoding: (1) single linear direction, (2) multi-dimensional subspace, (3) convex conic hull, (4) entropy proxy. Our design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and distractor evaluations across 8 stylistic shifts. We find that: (a) probes achieve near-perfect AUROC (>=0.998) on clean data but collapse under stylistic shifts; style-augmented probes recover near-perfect detection (mean AUROC 0.979-0.983) on unseen styles; (b) the single-direction hypothesis is rejected (k=1 captures only 0.61-0.80 AUROC), with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (max |rho|=0.454, max Delta-AUROC after residualization=0.004); and (d) deception does not form a significant linear subspace (per-domain k*=0), yet multi-dimensional probes (k>=5) recover the signal through distributed sub-threshold features. Probe fragility reflects distributional narrowness rather than an architectural limitation: style-augmented probes recover near-perfect detection at both 4B and 27B, establishing that the inverse scaling pattern is a training-distribution artifact rather than a genuine scale-dependent phenomenon.
Community
This comment has been hidden (marked as Resolved) Accepted at the GEM Workshop @ ACL 2026
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.27958 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.27958 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.27958 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.