v1</p>\n","updatedAt":"2026-06-02T14:43:11.297Z","author":{"_id":"63578f828ed056fa1cccb7a4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63578f828ed056fa1cccb7a4/WH1M3yyAwl9AcZdnRZqyj.png","fullname":"yubol-bobo","name":"yubol","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"pt","probability":0.9997303485870361},"editors":["yubol"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/63578f828ed056fa1cccb7a4/WH1M3yyAwl9AcZdnRZqyj.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29087","authors":[{"_id":"6a1eebe8e292c1c78ecb10f5","name":"Yubo Li","hidden":false},{"_id":"6a1eebe8e292c1c78ecb10f6","name":"Ramayya Krishnan","hidden":false},{"_id":"6a1eebe8e292c1c78ecb10f7","name":"Rema Padman","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure","submittedOnDailyBy":{"_id":"63578f828ed056fa1cccb7a4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63578f828ed056fa1cccb7a4/WH1M3yyAwl9AcZdnRZqyj.png","isPro":false,"fullname":"yubol-bobo","user":"yubol","type":"user","name":"yubol"},"summary":"Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a 2times 2 latent-versus-behavioral framework that flip-rate metrics and single-turn faithfulness probes both miss. Across three datasets (MT-Consistency, MMLU-Pro, GSM8K), the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think -- paired, within-model causal evidence that reasoning creates the gap. Across models the effect tracks the reasoning channel (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it). An independent GPT-4o judge corroborates 86% of UC labels; a token-level probe shows the answer-slot argmax is correct in 84% of UC cells; and a naive trace-anchored defense backfires. We release all trajectories, traces, and judge labels.","upvotes":1,"discussionId":"6a1eebe8e292c1c78ecb10f8","ai_summary":"Research reveals a new failure mode in reasoning models where correct chain-of-thought reasoning leads to incorrect final answers under adversarial conditions, demonstrated through controlled experiments across multiple datasets and models.","ai_keywords":["chain-of-thought","adversarial pressure","unfaithful capitulation","latent-versus-behavioral framework","flip-rate metrics","single-turn faithfulness probes","causal evidence","reasoning channel","think mode","no_think mode","GPT-4o judge","token-level probe","trace-anchored defense"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63578f828ed056fa1cccb7a4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63578f828ed056fa1cccb7a4/WH1M3yyAwl9AcZdnRZqyj.png","isPro":false,"fullname":"yubol-bobo","user":"yubol","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29087.md"}">
The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure
Abstract
Research reveals a new failure mode in reasoning models where correct chain-of-thought reasoning leads to incorrect final answers under adversarial conditions, demonstrated through controlled experiments across multiple datasets and models.
Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a 2times 2 latent-versus-behavioral framework that flip-rate metrics and single-turn faithfulness probes both miss. Across three datasets (MT-Consistency, MMLU-Pro, GSM8K), the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think -- paired, within-model causal evidence that reasoning creates the gap. Across models the effect tracks the reasoning channel (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it). An independent GPT-4o judge corroborates 86% of UC labels; a token-level probe shows the answer-slot argmax is correct in 84% of UC cells; and a naive trace-anchored defense backfires. We release all trajectories, traces, and judge labels.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.29087 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.29087 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.29087 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.