Hugging Face Daily Papers · May 20, 2026 · 3 min read

When Vision Speaks for Sound

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

WVS-Thud: an intervention-driven framework that mitigates the audio-visual Clever Hans effect by teaching video models to verify actual sounds instead of relying on visual shortcuts.</p>\n","updatedAt":"2026-05-20T06:00:43.076Z","author":{"_id":"643f9e2288d9d4488fd81c52","avatarUrl":"/avatars/e589c9cbd47022883cf33d7555bee89c.svg","fullname":"Tinghui Zhu","name":"DarthZhu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.743352472782135},"editors":["DarthZhu"],"editorAvatarUrls":["/avatars/e589c9cbd47022883cf33d7555bee89c.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.16403","authors":[{"_id":"6a0bc6998ca2d0b256380316","name":"Xiaofei Wen","hidden":false},{"_id":"6a0bc6998ca2d0b256380317","name":"Wenjie Jacky Mo","hidden":false},{"_id":"6a0bc6998ca2d0b256380318","name":"Xingyu Fu","hidden":false},{"_id":"6a0bc6998ca2d0b256380319","name":"Rui Cai","hidden":false},{"_id":"6a0bc6998ca2d0b25638031a","name":"Tinghui Zhu","hidden":false},{"_id":"6a0bc6998ca2d0b25638031b","name":"Wendi Li","hidden":false},{"_id":"6a0bc6998ca2d0b25638031c","name":"Yanan Xie","hidden":false},{"_id":"6a0bc6998ca2d0b25638031d","name":"Muhao Chen","hidden":false},{"_id":"6a0bc6998ca2d0b25638031e","name":"Peng Qi","hidden":false}],"publishedAt":"2026-05-13T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"When Vision Speaks for Sound","submittedOnDailyBy":{"_id":"643f9e2288d9d4488fd81c52","avatarUrl":"/avatars/e589c9cbd47022883cf33d7555bee89c.svg","isPro":false,"fullname":"Tinghui Zhu","user":"DarthZhu","type":"user","name":"DarthZhu"},"summary":"Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.","upvotes":40,"discussionId":"6a0bc6998ca2d0b25638031f","projectPage":"https://rakanwen.github.io/when-vision-speaks-for-sound/","githubRepo":"https://github.com/rakanWen/wvs-code","githubRepoAddedBy":"user","ai_summary":"Video-capable multimodal large language models exhibit apparent audio understanding driven by visual cues rather than actual audio processing, necessitating intervention-based frameworks for diagnosing and improving audio-visual alignment.","ai_keywords":["video-capable MLLMs","audio-visual Clever Hans effect","counterfactual audio edits","temporal synchronization","sound existence","audio-visual consistency","intervention-driven probing framework","alignment recipe","preference pairs","event-level general video preferences"],"githubStars":1},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"666be8ef81f01fbd60e84f01","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/666be8ef81f01fbd60e84f01/PiCUqr7XT96HGpa_GVRLr.jpeg","isPro":false,"fullname":"Muhao Chen","user":"Muhao","type":"user"},{"_id":"6730452684c683d645e7d446","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6730452684c683d645e7d446/FNJnQutO4Pxfn7FAnLBIS.jpeg","isPro":false,"fullname":"Rui(Yanson) Cai","user":"luisrui","type":"user"},{"_id":"67031824223c62ec88541d52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67031824223c62ec88541d52/uw_sNWmhRNPViPXEX03Qr.png","isPro":false,"fullname":"Xiaofei Wen","user":"Rakancorle1","type":"user"},{"_id":"643f9e2288d9d4488fd81c52","avatarUrl":"/avatars/e589c9cbd47022883cf33d7555bee89c.svg","isPro":false,"fullname":"Tinghui Zhu","user":"DarthZhu","type":"user"},{"_id":"67f813fe0cf572dd23da59e2","avatarUrl":"/avatars/7e845d2185e0f77cf2213900f2c8c2f6.svg","isPro":false,"fullname":"Wen","user":"Rakancorle11","type":"user"},{"_id":"690319e2f5eb4461cd234530","avatarUrl":"/avatars/ba451d23f5f475ae3b8e232280a8b2e3.svg","isPro":false,"fullname":"Ding Zou","user":"Dingzou","type":"user"},{"_id":"67569ffa01ef91bb70563f04","avatarUrl":"/avatars/07515e43627cdf8ccbd59e2bd9c75cc0.svg","isPro":false,"fullname":"Chris Linton","user":"Alinton","type":"user"},{"_id":"644a5aefe7d95a46f943efc3","avatarUrl":"/avatars/cc5ef8e6e212c675c4d016a173e5683b.svg","isPro":false,"fullname":"hujunchi","user":"hujunc","type":"user"},{"_id":"63fc8b50ee821f4bdfacaf57","avatarUrl":"/avatars/1646db50b6d0e5640e9dbdd73cfbb3b4.svg","isPro":false,"fullname":"WendiLi","user":"Windy0822","type":"user"},{"_id":"65d7f667bdb95b4bbc529435","avatarUrl":"/avatars/3cda8a6ab8fd3bbe443802f818aa1a7e.svg","isPro":false,"fullname":"James Huang","user":"jyhuang36","type":"user"},{"_id":"6a0d511a0213537284a36a3d","avatarUrl":"/avatars/0e80f95f4aeab9f3175ac3228d584d55.svg","isPro":false,"fullname":"yfann","user":"yma7796","type":"user"},{"_id":"668664a6a8c0872f23dc573d","avatarUrl":"/avatars/3aaecf56369dd86ba5177043c916d9b3.svg","isPro":false,"fullname":"Bowen Shi","user":"lygsbw","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.16403.md"}">

Papers

arxiv:2605.16403

When Vision Speaks for Sound

Published on May 13

· Submitted by

Tinghui Zhu on May 20

Upvote

Authors:

Abstract

Video-capable multimodal large language models exhibit apparent audio understanding driven by visual cues rather than actual audio processing, necessitating intervention-based frameworks for diagnosing and improving audio-visual alignment.

AI-generated summary

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

DarthZhu

Paper submitter about 7 hours ago

WVS-Thud: an intervention-driven framework that mitigates the audio-visual Clever Hans effect by teaching video models to verify actual sounds instead of relying on visual shortcuts.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.16403

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.16403 in a model README.md to link it from this page.

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.16403 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

When Vision Speaks for Sound

Abstract

Community

Models citing this paper 0

Datasets citing this paper 4

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers