Hugging Face Daily Papers · May 27, 2026 · 7 min read

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Hi everyone! This is Ye Yuan from McGill University and Mila. I am excited to share our recent paper: QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents.\nSocial deduction games have become a popular testbed for evaluating reasoning, deception, coordination, and belief modeling in large language models. However, most environments evaluate agents primarily through game outcomes like win rates and largely remain restricted to text-only interactions. Because of this, it is difficult to determine whether an agent's language is actually grounded in what it perceived and did, or to systematically identify reasoning failures.\nTo address this gap, we built QUACK, an open-source environment and evaluation framework designed to audit the grounding of agent language in multimodal social reasoning.\nHere is a quick overview of how QUACK works:\n<ul>\n<li>Multimodal & Partially Observable: Agents navigate configurable graph-based maps, complete location-bound tasks, and communicate under hidden-role adversarial incentives. They observe both rendered global and local views.</li>\n<li>Fully Replayable Ground Truth: Every episode is serialized into structured engine-level event logs, yielding a tick-by-tick ground-truth trajectory for each agent.</li>\n<li>Three-Tier Evaluation: We score agents at three levels: game outcomes (Tier 1), behavioral trajectories (Tier 2), and utterance-level consistency (Tier 3).</li>\n<li>Statement Verification Pipeline: At the core of Tier 3, this pipeline reconstructs the ground-truth trajectory and checks every single discussion claim against the reconstructed world state.</li>\n</ul>\nWe evaluated three frontier VLMs (GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7) across 270 games in both homogeneous and cross-model adversarial settings. The audit revealed that relying on win rates masks systematic reasoning failures. Specifically:\n<ul>\n<li>Spatial Hallucination: Even the strongest agents hallucinate 15.1% of their verifiable spatial claims.</li>\n<li>Unsupported Accusations: Agents make over half (53.5%) of their accusations without any grounded supporting evidence.</li>\n<li>Deception Collapse: When playing as the impostor (Duck), models exhibit a deception rate of 22.1%, meaning roughly a fifth of their verifiable claims are outright false. Furthermore, their deception sophistication is near zero, meaning they produce easily falsifiable lies rather than subtle alibis.</li>\n<li>Language-Action Inconsistency: Agents frequently state activities or routes that directly conflict with their logged actions.</li>\n</ul>\nWe have released the full engine, evaluation framework, toolkit, and logs at <a href=\"https://github.com/AAAAA-Academia-Attractions/QUACK\" rel=\"nofollow\">https://github.com/AAAAA-Academia-Attractions/QUACK</a> and raw logs at <a href=\"https://huggingface.co/datasets/5a-academia-attractions/QUACK\">https://huggingface.co/datasets/5a-academia-attractions/QUACK</a>.\nI would love to hear the community's thoughts on this! Happy to answer any questions about the environment design or the verification pipeline!\n","updatedAt":"2026-05-27T15:43:59.843Z","author":{"_id":"643724ca9dec089097c3433c","avatarUrl":"/avatars/d617c05f75ac50dbaa0c3b8d94b14287.svg","fullname":"Ye Yuan","name":"stevenyuan666","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8969417810440063},"editors":["stevenyuan666"],"editorAvatarUrls":["/avatars/d617c05f75ac50dbaa0c3b8d94b14287.svg"],"reactions":[{"reaction":"🔥","users":["stevenyuan666","wdrdg","barryjinks"],"count":3},{"reaction":"🚀","users":["stevenyuan666","wdrdg","barryjinks"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27068","authors":[{"_id":"6a166bc5e9aa3c8e322db4dd","user":{"_id":"643724ca9dec089097c3433c","avatarUrl":"/avatars/d617c05f75ac50dbaa0c3b8d94b14287.svg","isPro":false,"fullname":"Ye Yuan","user":"stevenyuan666","type":"user","name":"stevenyuan666"},"name":"Ye Yuan","status":"claimed_verified","statusLastChangedAt":"2026-05-27T07:41:00.458Z","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4de","name":"Rui Song","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4df","name":"Weien Li","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e0","name":"Zeyu Li","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e1","name":"Haochen Liu","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e2","name":"Xiangyu Kong","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e3","name":"Changjiang Han","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e4","name":"Yonghan Yang","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e5","name":"Zichen Zhao","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e6","name":"Zixuan Dong","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e7","name":"Fuyuan Lyu","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e8","name":"Bowei He","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4e9","name":"Haolun Wu","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4ea","name":"Jikun Kang","hidden":false},{"_id":"6a166bc5e9aa3c8e322db4eb","name":"Xue Liu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/643724ca9dec089097c3433c/AfQGv9HfvVXHg46qrC4Rp.mp4","https://cdn-uploads.huggingface.co/production/uploads/643724ca9dec089097c3433c/GHSYxTaXbucqNToBwvoed.jpeg"],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents","submittedOnDailyBy":{"_id":"643724ca9dec089097c3433c","avatarUrl":"/avatars/d617c05f75ac50dbaa0c3b8d94b14287.svg","isPro":false,"fullname":"Ye Yuan","user":"stevenyuan666","type":"user","name":"stevenyuan666"},"summary":"Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.","upvotes":11,"discussionId":"6a166bc6e9aa3c8e322db4ec","githubRepo":"https://github.com/AAAAA-Academia-Attractions/QUACK","githubRepoAddedBy":"user","ai_summary":"A multimodal social reasoning environment and evaluation framework called QUACK is introduced to audit the grounding of agent language through three-level assessment of game outcomes, behavioral trajectories, and utterance-level consistency.","ai_keywords":["social deduction games","Large Language Model agents","multimodal social reasoning","Statement Verification Pipeline","spatial hallucination","unsupported accusation","deception collapse","language-action inconsistency","VLMs","adversarial settings"],"githubStars":3,"organization":{"_id":"6983db493c6df70911445535","name":"5a-academia-attractions","fullname":"AAAAA Academia Attractions","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6354a0b05eac2d2efa7adc63/htawkkjvL32TRblaMldZ3.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"643724ca9dec089097c3433c","avatarUrl":"/avatars/d617c05f75ac50dbaa0c3b8d94b14287.svg","isPro":false,"fullname":"Ye Yuan","user":"stevenyuan666","type":"user"},{"_id":"6354a0b05eac2d2efa7adc63","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6354a0b05eac2d2efa7adc63/ZAKDIdVm_HD7ge69lMuAD.jpeg","isPro":false,"fullname":"Rui Song","user":"wdrdg","type":"user"},{"_id":"6487e108eec01aee99ce77a3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/hPOBkKm24Wr4txAb5RC0p.jpeg","isPro":false,"fullname":"Haolun Wu","user":"haolun-wu","type":"user"},{"_id":"6a0eeb7387bbff5c953e8ced","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6a0eeb7387bbff5c953e8ced/w3K4bIrdh41-6d2sT6h4O.jpeg","isPro":false,"fullname":"Yonghan Yang","user":"SuperHarry","type":"user"},{"_id":"68cd57ab10ad4c13229c9995","avatarUrl":"/avatars/32421730060f867a486de575fbb317a1.svg","isPro":false,"fullname":"Changjiang Han","user":"RandHan","type":"user"},{"_id":"65dc716d2a6348b05294f635","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65dc716d2a6348b05294f635/vWL_0DqhguiTYJlogEADm.jpeg","isPro":false,"fullname":"lhcdhr","user":"lhcdhr","type":"user"},{"_id":"64b9ded93e62024f557813b0","avatarUrl":"/avatars/dac8a322ed7d4cadd24acb677b69f19f.svg","isPro":false,"fullname":"Barry Li","user":"barryjinks","type":"user"},{"_id":"65eae34e7601e7c2e4df60bf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65eae34e7601e7c2e4df60bf/Qrug3vkDcd_iAgaTE_gCI.jpeg","isPro":false,"fullname":"Bokwai Ho","user":"Bokwai","type":"user"},{"_id":"647466b8b68461d5cf795e3c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/647466b8b68461d5cf795e3c/zaK6sdCbdPfYu14vg2Ty6.png","isPro":false,"fullname":"LIKirin","user":"LIKirin","type":"user"},{"_id":"69cd403656d4a804f2ff8ebe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/uOQCFzhTiT7ISHJTmsCGL.png","isPro":false,"fullname":"Charles HARRIS","user":"gyuxuan553","type":"user"},{"_id":"636865b8cca0a0a962c21f3f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Mja7cpws4gb2Jmdj_foPA.png","isPro":false,"fullname":"Xiangru (Edward) Jian","user":"HideOnBush","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6983db493c6df70911445535","name":"5a-academia-attractions","fullname":"AAAAA Academia Attractions","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6354a0b05eac2d2efa7adc63/htawkkjvL32TRblaMldZ3.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27068.md"}">

Papers

arxiv:2605.27068

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Published on May 26

· Submitted by

Ye Yuan on May 27

AAAAA Academia Attractions

Upvote

Authors:

Ye Yuan ,

Abstract

A multimodal social reasoning environment and evaluation framework called QUACK is introduced to audit the grounding of agent language through three-level assessment of game outcomes, behavioral trajectories, and utterance-level consistency.

AI-generated summary

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.

View arXiv page View PDF GitHub 3 Add to collection

Community

stevenyuan666

Paper author Paper submitter about 9 hours ago

Hi everyone! This is Ye Yuan from McGill University and Mila. I am excited to share our recent paper: QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents.

Social deduction games have become a popular testbed for evaluating reasoning, deception, coordination, and belief modeling in large language models. However, most environments evaluate agents primarily through game outcomes like win rates and largely remain restricted to text-only interactions. Because of this, it is difficult to determine whether an agent's language is actually grounded in what it perceived and did, or to systematically identify reasoning failures.

To address this gap, we built QUACK, an open-source environment and evaluation framework designed to audit the grounding of agent language in multimodal social reasoning.

Here is a quick overview of how QUACK works:

Multimodal & Partially Observable: Agents navigate configurable graph-based maps, complete location-bound tasks, and communicate under hidden-role adversarial incentives. They observe both rendered global and local views.
Fully Replayable Ground Truth: Every episode is serialized into structured engine-level event logs, yielding a tick-by-tick ground-truth trajectory for each agent.
Three-Tier Evaluation: We score agents at three levels: game outcomes (Tier 1), behavioral trajectories (Tier 2), and utterance-level consistency (Tier 3).
Statement Verification Pipeline: At the core of Tier 3, this pipeline reconstructs the ground-truth trajectory and checks every single discussion claim against the reconstructed world state.

We evaluated three frontier VLMs (GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7) across 270 games in both homogeneous and cross-model adversarial settings. The audit revealed that relying on win rates masks systematic reasoning failures. Specifically:

Spatial Hallucination: Even the strongest agents hallucinate 15.1% of their verifiable spatial claims.
Unsupported Accusations: Agents make over half (53.5%) of their accusations without any grounded supporting evidence.
Deception Collapse: When playing as the impostor (Duck), models exhibit a deception rate of 22.1%, meaning roughly a fifth of their verifiable claims are outright false. Furthermore, their deception sophistication is near zero, meaning they produce easily falsifiable lies rather than subtle alibis.
Language-Action Inconsistency: Agents frequently state activities or routes that directly conflict with their logged actions.

We have released the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK and raw logs at https://huggingface.co/datasets/5a-academia-attractions/QUACK.

I would love to hear the community's thoughts on this! Happy to answer any questions about the environment design or the verification pipeline!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.27068

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.27068 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27068 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers