Hi everyone! This is Ye Yuan from McGill University and Mila. I am excited to share our recent paper: <strong>QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents</strong>.</p>\n<p>Social deduction games have become a popular testbed for evaluating reasoning, deception, coordination, and belief modeling in large language models. However, most environments evaluate agents primarily through game outcomes like win rates and largely remain restricted to text-only interactions. Because of this, it is difficult to determine whether an agent's language is actually grounded in what it perceived and did, or to systematically identify reasoning failures.</p>\n<p>To address this gap, we built <strong>QUACK</strong>, an open-source environment and evaluation framework designed to audit the grounding of agent language in multimodal social reasoning.</p>\n<p>Here is a quick overview of how QUACK works:</p>\n<ul>\n<li><strong>Multimodal & Partially Observable:</strong> Agents navigate configurable graph-based maps, complete location-bound tasks, and communicate under hidden-role adversarial incentives. They observe both rendered global and local views.</li>\n<li><strong>Fully Replayable Ground Truth:</strong> Every episode is serialized into structured engine-level event logs, yielding a tick-by-tick ground-truth trajectory for each agent.</li>\n<li><strong>Three-Tier Evaluation:</strong> We score agents at three levels: game outcomes (Tier 1), behavioral trajectories (Tier 2), and utterance-level consistency (Tier 3).</li>\n<li><strong>Statement Verification Pipeline:</strong> At the core of Tier 3, this pipeline reconstructs the ground-truth trajectory and checks every single discussion claim against the reconstructed world state.</li>\n</ul>\n<p>We evaluated three frontier VLMs (GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7) across 270 games in both homogeneous and cross-model adversarial settings. The audit revealed that relying on win rates masks systematic reasoning failures. Specifically:</p>\n<ul>\n<li><strong>Spatial Hallucination:</strong> Even the strongest agents hallucinate 15.1% of their verifiable spatial claims.</li>\n<li><strong>Unsupported Accusations:</strong> Agents make over half (53.5%) of their accusations without any grounded supporting evidence.</li>\n<li><strong>Deception Collapse:</strong> When playing as the impostor (Duck), models exhibit a deception rate of 22.1%, meaning roughly a fifth of their verifiable claims are outright false. Furthermore, their deception sophistication is near zero, meaning they produce easily falsifiable lies rather than subtle alibis.</li>\n<li><strong>Language-Action Inconsistency:</strong> Agents frequently state activities or routes that directly conflict with their logged actions.</li>\n</ul>\n<p>We have released the full engine, evaluation framework, toolkit, and logs at <a href=\"https://github.com/AAAAA-Academia-Attractions/QUACK\" rel=\"nofollow\">https://github.com/AAAAA-Academia-Attractions/QUACK</a> and raw logs at <a href=\"https://huggingface.co/datasets/5a-academia-attractions/QUACK\">https://huggingface.co/datasets/5a-academia-attractions/QUACK</a>.</p>\n<p>I would love to hear the community's thoughts on this! Happy to answer any questions about the environment design or the verification pipeline!</p>\n","updatedAt":"2026-05-27T15:43:59.843Z","author":{"_id":"643724ca9dec089097c3433c","avatarUrl":"/avatars/d617c05f75ac50dbaa0c3b8d94b14287.svg","fullname":"Ye Yuan","name":"stevenyuan666","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8969417810440063},"editors":["stevenyuan666"],"editorAvatarUrls":["/avatars/d617c05f75ac50dbaa0c3b8d94b14287.svg"],"reactions":[{"reaction":"🔥","users":["stevenyuan666","wdrdg","barryjinks"],"count":3},{"reaction":"🚀","users":["stevenyuan666","wdrdg","barryjinks"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27068","authors":[{"_id":"6a166bc5e9aa3c8e322db4dd","user":{"_id":"643724ca9dec089097c3433c","avatarUrl":"/avatars/d617c05f75ac50dbaa0c3b8d94b14287.svg","isPro":false,"fullname":"Ye Yuan","user":"stevenyuan666","type":"user","name":"stevenyuan666"},"name":"Ye Yuan","status":"claimed_verified","statusLastChangedAt":"2026-05-27T07:41:00.458Z","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4de","name":"Rui Song","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4df","name":"Weien Li","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e0","name":"Zeyu Li","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e1","name":"Haochen Liu","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e2","name":"Xiangyu Kong","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e3","name":"Changjiang Han","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e4","name":"Yonghan Yang","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e5","name":"Zichen Zhao","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e6","name":"Zixuan Dong","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e7","name":"Fuyuan Lyu","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e8","name":"Bowei He","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e9","name":"Haolun Wu","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4ea","name":"Jikun Kang","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4eb","name":"Xue Liu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/643724ca9dec089097c3433c/AfQGv9HfvVXHg46qrC4Rp.mp4","https://cdn-uploads.huggingface.co/production/uploads/643724ca9dec089097c3433c/GHSYxTaXbucqNToBwvoed.jpeg"],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents","submittedOnDailyBy":{"_id":"643724ca9dec089097c3433c","avatarUrl":"/avatars/d617c05f75ac50dbaa0c3b8d94b14287.svg","isPro":false,"fullname":"Ye Yuan","user":"stevenyuan666","type":"user","name":"stevenyuan666"},"summary":"Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.","upvotes":11,"discussionId":"6a166bc6e9aa3c8e322db4ec","githubRepo":"https://github.com/AAAAA-Academia-Attractions/QUACK","githubRepoAddedBy":"user","ai_summary":"A multimodal social reasoning environment and evaluation framework called QUACK is introduced to audit the grounding of agent language through three-level assessment of game outcomes, behavioral trajectories, and utterance-level consistency.","ai_keywords":["social deduction games","Large Language Model agents","multimodal social reasoning","Statement Verification Pipeline","spatial hallucination","unsupported accusation","deception collapse","language-action inconsistency","VLMs","adversarial settings"],"githubStars":3,"organization":{"_id":"6983db493c6df70911445535","name":"5a-academia-attractions","fullname":"AAAAA Academia Attractions","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6354a0b05eac2d2efa7adc63/htawkkjvL32TRblaMldZ3.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"643724ca9dec089097c3433c","avatarUrl":"/avatars/d617c05f75ac50dbaa0c3b8d94b14287.svg","isPro":false,"fullname":"Ye Yuan","user":"stevenyuan666","type":"user"},{"_id":"6354a0b05eac2d2efa7adc63","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6354a0b05eac2d2efa7adc63/ZAKDIdVm_HD7ge69lMuAD.jpeg","isPro":false,"fullname":"Rui Song","user":"wdrdg","type":"user"},{"_id":"6487e108eec01aee99ce77a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/hPOBkKm24Wr4txAb5RC0p.jpeg","isPro":false,"fullname":"Haolun Wu","user":"haolun-wu","type":"user"},{"_id":"6a0eeb7387bbff5c953e8ced","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a0eeb7387bbff5c953e8ced/w3K4bIrdh41-6d2sT6h4O.jpeg","isPro":false,"fullname":"Yonghan Yang","user":"SuperHarry","type":"user"},{"_id":"68cd57ab10ad4c13229c9995","avatarUrl":"/avatars/32421730060f867a486de575fbb317a1.svg","isPro":false,"fullname":"Changjiang Han","user":"RandHan","type":"user"},{"_id":"65dc716d2a6348b05294f635","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65dc716d2a6348b05294f635/vWL_0DqhguiTYJlogEADm.jpeg","isPro":false,"fullname":"lhcdhr","user":"lhcdhr","type":"user"},{"_id":"64b9ded93e62024f557813b0","avatarUrl":"/avatars/dac8a322ed7d4cadd24acb677b69f19f.svg","isPro":false,"fullname":"Barry Li","user":"barryjinks","type":"user"},{"_id":"65eae34e7601e7c2e4df60bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65eae34e7601e7c2e4df60bf/Qrug3vkDcd_iAgaTE_gCI.jpeg","isPro":false,"fullname":"Bokwai Ho","user":"Bokwai","type":"user"},{"_id":"647466b8b68461d5cf795e3c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/647466b8b68461d5cf795e3c/zaK6sdCbdPfYu14vg2Ty6.png","isPro":false,"fullname":"LIKirin","user":"LIKirin","type":"user"},{"_id":"69cd403656d4a804f2ff8ebe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/uOQCFzhTiT7ISHJTmsCGL.png","isPro":false,"fullname":"Charles HARRIS","user":"gyuxuan553","type":"user"},{"_id":"636865b8cca0a0a962c21f3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Mja7cpws4gb2Jmdj_foPA.png","isPro":false,"fullname":"Xiangru (Edward) Jian","user":"HideOnBush","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6983db493c6df70911445535","name":"5a-academia-attractions","fullname":"AAAAA Academia Attractions","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6354a0b05eac2d2efa7adc63/htawkkjvL32TRblaMldZ3.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27068.md"}">
QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A multimodal social reasoning environment and evaluation framework called QUACK is introduced to audit the grounding of agent language through three-level assessment of game outcomes, behavioral trajectories, and utterance-level consistency.
AI-generated summary
Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.
Community
Hi everyone! This is Ye Yuan from McGill University and Mila. I am excited to share our recent paper: QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents.
Social deduction games have become a popular testbed for evaluating reasoning, deception, coordination, and belief modeling in large language models. However, most environments evaluate agents primarily through game outcomes like win rates and largely remain restricted to text-only interactions. Because of this, it is difficult to determine whether an agent's language is actually grounded in what it perceived and did, or to systematically identify reasoning failures.
To address this gap, we built QUACK, an open-source environment and evaluation framework designed to audit the grounding of agent language in multimodal social reasoning.
Here is a quick overview of how QUACK works:
- Multimodal & Partially Observable: Agents navigate configurable graph-based maps, complete location-bound tasks, and communicate under hidden-role adversarial incentives. They observe both rendered global and local views.
- Fully Replayable Ground Truth: Every episode is serialized into structured engine-level event logs, yielding a tick-by-tick ground-truth trajectory for each agent.
- Three-Tier Evaluation: We score agents at three levels: game outcomes (Tier 1), behavioral trajectories (Tier 2), and utterance-level consistency (Tier 3).
- Statement Verification Pipeline: At the core of Tier 3, this pipeline reconstructs the ground-truth trajectory and checks every single discussion claim against the reconstructed world state.
We evaluated three frontier VLMs (GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7) across 270 games in both homogeneous and cross-model adversarial settings. The audit revealed that relying on win rates masks systematic reasoning failures. Specifically:
- Spatial Hallucination: Even the strongest agents hallucinate 15.1% of their verifiable spatial claims.
- Unsupported Accusations: Agents make over half (53.5%) of their accusations without any grounded supporting evidence.
- Deception Collapse: When playing as the impostor (Duck), models exhibit a deception rate of 22.1%, meaning roughly a fifth of their verifiable claims are outright false. Furthermore, their deception sophistication is near zero, meaning they produce easily falsifiable lies rather than subtle alibis.
- Language-Action Inconsistency: Agents frequently state activities or routes that directly conflict with their logged actions.
We have released the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK and raw logs at https://huggingface.co/datasets/5a-academia-attractions/QUACK.
I would love to hear the community's thoughts on this! Happy to answer any questions about the environment design or the verification pipeline!
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.27068 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.27068 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.