As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.</p>\n","updatedAt":"2026-05-29T03:50:02.645Z","author":{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","fullname":"Yu-Lun Liu","name":"yulunliu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8583223223686218},"editors":["yulunliu"],"editorAvatarUrls":["/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg"],"reactions":[],"isReport":false}},{"id":"6a1a41525b595c5088d1638e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:45:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models](https://huggingface.co/papers/2605.23699) (2026)\n* [CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering](https://huggingface.co/papers/2605.23216) (2026)\n* [From Priors to Perception: Grounding Video-LLMs in Physical Reality](https://huggingface.co/papers/2605.04515) (2026)\n* [Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation](https://huggingface.co/papers/2605.28230) (2026)\n* [PhyWorld: Physics-Faithful World Model for Video Generation](https://huggingface.co/papers/2605.19242) (2026)\n* [Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos](https://huggingface.co/papers/2605.18984) (2026)\n* [Benchmarking Single-Factor Physical Video-to-Audio Generation](https://huggingface.co/papers/2605.30339) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.23699\">CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23216\">CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.04515\">From Priors to Perception: Grounding Video-LLMs in Physical Reality</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.28230\">Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.19242\">PhyWorld: Physics-Faithful World Model for Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18984\">Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30339\">Benchmarking Single-Factor Physical Video-to-Audio Generation</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:45:54.016Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7241809368133545},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30346","authors":[{"_id":"6a190c9c56b4bb14ec65cfc8","name":"You-Zhe Xie","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfc9","name":"Yu-Hsuan Li","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfca","user":{"_id":"655f1770f74fa124d1172ec1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655f1770f74fa124d1172ec1/bdYocZ1qN50CAfb2z2YLA.png","isPro":false,"fullname":"Jie-Ying Lee","user":"jayinnn","type":"user","name":"jayinnn"},"name":"Jie-Ying Lee","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:50:17.882Z","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfcb","name":"Kaipeng Zhang","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfcc","name":"Yu-Lun Liu","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfcd","name":"Zhixiang Wang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6459d5da3b6fafd9664807ab/2k03UStSCXlYg29KS07LE.jpeg"],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"YoCausal: How Far is Video Generation from World Model? A Causality Perspective","submittedOnDailyBy":{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","isPro":false,"fullname":"Yu-Lun Liu","user":"yulunliu","type":"user","name":"yulunliu"},"summary":"As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.","upvotes":37,"discussionId":"6a190c9c56b4bb14ec65cfce","projectPage":"https://www.youzhexie.me/papers/YoCausal/index.html","githubRepo":"https://github.com/youzhe0305/YoCausal","githubRepoAddedBy":"user","ai_summary":"Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis.","ai_keywords":["video diffusion models","world models","causality","temporal patterns","Violation of Expectation","reverse surprise index","causality cognition index","visual language model","denoising loss","sim-to-real gap"],"githubStars":24,"organization":{"_id":"689f08c50df4fcf7fddc0b08","name":"ShandaAI","fullname":"Alaya Studio","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63342778d92c5842ae728aef/dNCvNz9MMshksG2xspIbM.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","isPro":false,"fullname":"Yu-Lun Liu","user":"yulunliu","type":"user"},{"_id":"687e104385ef4f79e80c0704","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/687e104385ef4f79e80c0704/LM8kekQLImzw3s8fBzGWM.jpeg","isPro":false,"fullname":"Sean","user":"Sean20405","type":"user"},{"_id":"6818c7a3ce5cfcfc608d525f","avatarUrl":"/avatars/8b633a047cacc11affe574a0f3081725.svg","isPro":false,"fullname":"ybf","user":"bamboofan","type":"user"},{"_id":"6672ebc506b6d49dda7598c5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6672ebc506b6d49dda7598c5/9yUeKzZZVtBoy2L-dNPMf.png","isPro":false,"fullname":"Sytwu","user":"Sytwu","type":"user"},{"_id":"64cdecee2f1f9578a0e701c8","avatarUrl":"/avatars/95a51dd4e1b7b9366ebcbd6028ad148b.svg","isPro":false,"fullname":"Ray","user":"Shigon","type":"user"},{"_id":"670753680681f4d0a94ebccf","avatarUrl":"/avatars/1aa6f063bacdb25d36784d0f93bb2224.svg","isPro":true,"fullname":"ChengYou Lu","user":"ChengYou305","type":"user"},{"_id":"666afb91e936f6cbcfc8b50c","avatarUrl":"/avatars/a618c074c9e11e6b9444d0e366efbbdf.svg","isPro":false,"fullname":"LIN, CHIN-YANG","user":"linjohnss","type":"user"},{"_id":"69c3bd9dcb293e5c628f3a76","avatarUrl":"/avatars/c714a0a73a6c8d50879c5aa064a38bd1.svg","isPro":false,"fullname":"max","user":"maxwellll5","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"655f1770f74fa124d1172ec1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655f1770f74fa124d1172ec1/bdYocZ1qN50CAfb2z2YLA.png","isPro":false,"fullname":"Jie-Ying Lee","user":"jayinnn","type":"user"},{"_id":"6307a98795b2ab342fec0cf7","avatarUrl":"/avatars/85b261bcdda4717a6e40491f6c7b7a89.svg","isPro":false,"fullname":"Zhixiang Wang","user":"wangzx1994","type":"user"},{"_id":"6672fe26c33b5004b69a1d6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Ff8cOS6Y0TPUSihx_hOMe.png","isPro":false,"fullname":"YouZhe","user":"YouZhe","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"689f08c50df4fcf7fddc0b08","name":"ShandaAI","fullname":"Alaya Studio","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63342778d92c5842ae728aef/dNCvNz9MMshksG2xspIbM.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30346.md"}">
YoCausal: How Far is Video Generation from World Model? A Causality Perspective
Abstract
Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis.
AI-generated summary
As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.
Community
As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.30346 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.30346 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.