Hugging Face Daily Papers · May 29, 2026 · 6 min read

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.\n","updatedAt":"2026-05-29T03:50:02.645Z","author":{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","fullname":"Yu-Lun Liu","name":"yulunliu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8583223223686218},"editors":["yulunliu"],"editorAvatarUrls":["/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg"],"reactions":[],"isReport":false}},{"id":"6a1a41525b595c5088d1638e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:45:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models](https://huggingface.co/papers/2605.23699) (2026)\n* [CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering](https://huggingface.co/papers/2605.23216) (2026)\n* [From Priors to Perception: Grounding Video-LLMs in Physical Reality](https://huggingface.co/papers/2605.04515) (2026)\n* [Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation](https://huggingface.co/papers/2605.28230) (2026)\n* [PhyWorld: Physics-Faithful World Model for Video Generation](https://huggingface.co/papers/2605.19242) (2026)\n* [Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos](https://huggingface.co/papers/2605.18984) (2026)\n* [Benchmarking Single-Factor Physical Video-to-Audio Generation](https://huggingface.co/papers/2605.30339) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.23699\">CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23216\">CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.04515\">From Priors to Perception: Grounding Video-LLMs in Physical Reality</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.28230\">Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.19242\">PhyWorld: Physics-Faithful World Model for Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18984\">Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30339\">Benchmarking Single-Factor Physical Video-to-Audio Generation</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-30T01:45:54.016Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7241809368133545},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30346","authors":[{"_id":"6a190c9c56b4bb14ec65cfc8","name":"You-Zhe Xie","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfc9","name":"Yu-Hsuan Li","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfca","user":{"_id":"655f1770f74fa124d1172ec1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655f1770f74fa124d1172ec1/bdYocZ1qN50CAfb2z2YLA.png","isPro":false,"fullname":"Jie-Ying Lee","user":"jayinnn","type":"user","name":"jayinnn"},"name":"Jie-Ying Lee","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:50:17.882Z","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfcb","name":"Kaipeng Zhang","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfcc","name":"Yu-Lun Liu","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfcd","name":"Zhixiang Wang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6459d5da3b6fafd9664807ab/2k03UStSCXlYg29KS07LE.jpeg"],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"YoCausal: How Far is Video Generation from World Model? A Causality Perspective","submittedOnDailyBy":{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","isPro":false,"fullname":"Yu-Lun Liu","user":"yulunliu","type":"user","name":"yulunliu"},"summary":"As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.","upvotes":37,"discussionId":"6a190c9c56b4bb14ec65cfce","projectPage":"https://www.youzhexie.me/papers/YoCausal/index.html","githubRepo":"https://github.com/youzhe0305/YoCausal","githubRepoAddedBy":"user","ai_summary":"Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis.","ai_keywords":["video diffusion models","world models","causality","temporal patterns","Violation of Expectation","reverse surprise index","causality cognition index","visual language model","denoising loss","sim-to-real gap"],"githubStars":24,"organization":{"_id":"689f08c50df4fcf7fddc0b08","name":"ShandaAI","fullname":"Alaya Studio","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63342778d92c5842ae728aef/dNCvNz9MMshksG2xspIbM.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","isPro":false,"fullname":"Yu-Lun Liu","user":"yulunliu","type":"user"},{"_id":"687e104385ef4f79e80c0704","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/687e104385ef4f79e80c0704/LM8kekQLImzw3s8fBzGWM.jpeg","isPro":false,"fullname":"Sean","user":"Sean20405","type":"user"},{"_id":"6818c7a3ce5cfcfc608d525f","avatarUrl":"/avatars/8b633a047cacc11affe574a0f3081725.svg","isPro":false,"fullname":"ybf","user":"bamboofan","type":"user"},{"_id":"6672ebc506b6d49dda7598c5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6672ebc506b6d49dda7598c5/9yUeKzZZVtBoy2L-dNPMf.png","isPro":false,"fullname":"Sytwu","user":"Sytwu","type":"user"},{"_id":"64cdecee2f1f9578a0e701c8","avatarUrl":"/avatars/95a51dd4e1b7b9366ebcbd6028ad148b.svg","isPro":false,"fullname":"Ray","user":"Shigon","type":"user"},{"_id":"670753680681f4d0a94ebccf","avatarUrl":"/avatars/1aa6f063bacdb25d36784d0f93bb2224.svg","isPro":true,"fullname":"ChengYou Lu","user":"ChengYou305","type":"user"},{"_id":"666afb91e936f6cbcfc8b50c","avatarUrl":"/avatars/a618c074c9e11e6b9444d0e366efbbdf.svg","isPro":false,"fullname":"LIN, CHIN-YANG","user":"linjohnss","type":"user"},{"_id":"69c3bd9dcb293e5c628f3a76","avatarUrl":"/avatars/c714a0a73a6c8d50879c5aa064a38bd1.svg","isPro":false,"fullname":"max","user":"maxwellll5","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"655f1770f74fa124d1172ec1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655f1770f74fa124d1172ec1/bdYocZ1qN50CAfb2z2YLA.png","isPro":false,"fullname":"Jie-Ying Lee","user":"jayinnn","type":"user"},{"_id":"6307a98795b2ab342fec0cf7","avatarUrl":"/avatars/85b261bcdda4717a6e40491f6c7b7a89.svg","isPro":false,"fullname":"Zhixiang Wang","user":"wangzx1994","type":"user"},{"_id":"6672fe26c33b5004b69a1d6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Ff8cOS6Y0TPUSihx_hOMe.png","isPro":false,"fullname":"YouZhe","user":"YouZhe","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"689f08c50df4fcf7fddc0b08","name":"ShandaAI","fullname":"Alaya Studio","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63342778d92c5842ae728aef/dNCvNz9MMshksG2xspIbM.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30346.md"}">

Papers

arxiv:2605.30346

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Published on May 28

· Submitted by

Yu-Lun Liu on May 29

Alaya Studio

Upvote

Authors:

Jie-Ying Lee ,

Abstract

Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis.

AI-generated summary