Hugging Face Daily Papers · · 6 min read

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.</p>\n","updatedAt":"2026-05-29T03:50:02.645Z","author":{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","fullname":"Yu-Lun Liu","name":"yulunliu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8583223223686218},"editors":["yulunliu"],"editorAvatarUrls":["/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg"],"reactions":[],"isReport":false}},{"id":"6a1a41525b595c5088d1638e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:45:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models](https://huggingface.co/papers/2605.23699) (2026)\n* [CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering](https://huggingface.co/papers/2605.23216) (2026)\n* [From Priors to Perception: Grounding Video-LLMs in Physical Reality](https://huggingface.co/papers/2605.04515) (2026)\n* [Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation](https://huggingface.co/papers/2605.28230) (2026)\n* [PhyWorld: Physics-Faithful World Model for Video Generation](https://huggingface.co/papers/2605.19242) (2026)\n* [Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos](https://huggingface.co/papers/2605.18984) (2026)\n* [Benchmarking Single-Factor Physical Video-to-Audio Generation](https://huggingface.co/papers/2605.30339) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.23699\">CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23216\">CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.04515\">From Priors to Perception: Grounding Video-LLMs in Physical Reality</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.28230\">Proprio: Latent Self-Scoring and Inference-Time Refinement for Physically Plausible Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.19242\">PhyWorld: Physics-Faithful World Model for Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18984\">Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30339\">Benchmarking Single-Factor Physical Video-to-Audio Generation</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:45:54.016Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7241809368133545},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30346","authors":[{"_id":"6a190c9c56b4bb14ec65cfc8","name":"You-Zhe Xie","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfc9","name":"Yu-Hsuan Li","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfca","user":{"_id":"655f1770f74fa124d1172ec1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655f1770f74fa124d1172ec1/bdYocZ1qN50CAfb2z2YLA.png","isPro":false,"fullname":"Jie-Ying Lee","user":"jayinnn","type":"user","name":"jayinnn"},"name":"Jie-Ying Lee","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:50:17.882Z","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfcb","name":"Kaipeng Zhang","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfcc","name":"Yu-Lun Liu","hidden":false},{"_id":"6a190c9c56b4bb14ec65cfcd","name":"Zhixiang Wang","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6459d5da3b6fafd9664807ab/2k03UStSCXlYg29KS07LE.jpeg"],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"YoCausal: How Far is Video Generation from World Model? A Causality Perspective","submittedOnDailyBy":{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","isPro":false,"fullname":"Yu-Lun Liu","user":"yulunliu","type":"user","name":"yulunliu"},"summary":"As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.","upvotes":37,"discussionId":"6a190c9c56b4bb14ec65cfce","projectPage":"https://www.youzhexie.me/papers/YoCausal/index.html","githubRepo":"https://github.com/youzhe0305/YoCausal","githubRepoAddedBy":"user","ai_summary":"Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis.","ai_keywords":["video diffusion models","world models","causality","temporal patterns","Violation of Expectation","reverse surprise index","causality cognition index","visual language model","denoising loss","sim-to-real gap"],"githubStars":24,"organization":{"_id":"689f08c50df4fcf7fddc0b08","name":"ShandaAI","fullname":"Alaya Studio","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63342778d92c5842ae728aef/dNCvNz9MMshksG2xspIbM.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6459d5da3b6fafd9664807ab","avatarUrl":"/avatars/57430d1bbde3a2fe5586e5fbcafb0e74.svg","isPro":false,"fullname":"Yu-Lun Liu","user":"yulunliu","type":"user"},{"_id":"687e104385ef4f79e80c0704","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/687e104385ef4f79e80c0704/LM8kekQLImzw3s8fBzGWM.jpeg","isPro":false,"fullname":"Sean","user":"Sean20405","type":"user"},{"_id":"6818c7a3ce5cfcfc608d525f","avatarUrl":"/avatars/8b633a047cacc11affe574a0f3081725.svg","isPro":false,"fullname":"ybf","user":"bamboofan","type":"user"},{"_id":"6672ebc506b6d49dda7598c5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6672ebc506b6d49dda7598c5/9yUeKzZZVtBoy2L-dNPMf.png","isPro":false,"fullname":"Sytwu","user":"Sytwu","type":"user"},{"_id":"64cdecee2f1f9578a0e701c8","avatarUrl":"/avatars/95a51dd4e1b7b9366ebcbd6028ad148b.svg","isPro":false,"fullname":"Ray","user":"Shigon","type":"user"},{"_id":"670753680681f4d0a94ebccf","avatarUrl":"/avatars/1aa6f063bacdb25d36784d0f93bb2224.svg","isPro":true,"fullname":"ChengYou Lu","user":"ChengYou305","type":"user"},{"_id":"666afb91e936f6cbcfc8b50c","avatarUrl":"/avatars/a618c074c9e11e6b9444d0e366efbbdf.svg","isPro":false,"fullname":"LIN, CHIN-YANG","user":"linjohnss","type":"user"},{"_id":"69c3bd9dcb293e5c628f3a76","avatarUrl":"/avatars/c714a0a73a6c8d50879c5aa064a38bd1.svg","isPro":false,"fullname":"max","user":"maxwellll5","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"655f1770f74fa124d1172ec1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/655f1770f74fa124d1172ec1/bdYocZ1qN50CAfb2z2YLA.png","isPro":false,"fullname":"Jie-Ying Lee","user":"jayinnn","type":"user"},{"_id":"6307a98795b2ab342fec0cf7","avatarUrl":"/avatars/85b261bcdda4717a6e40491f6c7b7a89.svg","isPro":false,"fullname":"Zhixiang Wang","user":"wangzx1994","type":"user"},{"_id":"6672fe26c33b5004b69a1d6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Ff8cOS6Y0TPUSihx_hOMe.png","isPro":false,"fullname":"YouZhe","user":"YouZhe","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"689f08c50df4fcf7fddc0b08","name":"ShandaAI","fullname":"Alaya Studio","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/63342778d92c5842ae728aef/dNCvNz9MMshksG2xspIbM.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30346.md"}">
Papers
arxiv:2605.30346

YoCausal: How Far is Video Generation from World Model? A Causality Perspective

Published on May 28
· Submitted by
Yu-Lun Liu
on May 29
Authors:
,
,
,
,

Abstract

Video diffusion models exhibit arrow-of-time perception without true causal understanding, as demonstrated by a novel benchmark measuring causal cognition through reverse surprise and visual language model analysis.

AI-generated summary

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

Community

Paper submitter 1 day ago

As video diffusion models (VDMs) advance toward world models, a key question arises: do they truly understand causality, or merely overfit to statistical temporal patterns? Existing benchmarks mostly rely on synthetic data, limiting real-world generalization due to the sim-to-real gap. We present YoCausal, a two-level benchmark inspired by the Violation of Expectation (VoE) paradigm from cognitive science. By temporally reversing real-world videos at zero cost as natural counterfactual samples, YoCausal establishes an arbitrarily extensible evaluation protocol. Level 1 introduces the Reverse Surprise Index (RSI), quantifying arrow-of-time perception via denoising loss. Level 2 introduces the Causality Cognition Index (CCI), which leverages a VLM to stratify datasets into causal and non-causal subsets, disentangling genuine causal reasoning from temporal bias. Evaluation of 13 state-of-the-art VDMs reveals that perceiving the arrow of time does not imply understanding causality, and a significant gap persists relative to human-level causal cognition.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30346
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30346 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30346 in a Space README.md to link it from this page.

Collections including this paper 3

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers