Hugging Face Daily Papers · May 26, 2026 · 3 min read

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

upload</p>\n","updatedAt":"2026-05-26T03:03:36.175Z","author":{"_id":"642cf9a6ad221e8f41d3e0d0","avatarUrl":"/avatars/5fed346e20870a697864af3975aef8c0.svg","fullname":"Kaining Ying","name":"Kaining","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9909989237785339},"editors":["Kaining"],"editorAvatarUrls":["/avatars/5fed346e20870a697864af3975aef8c0.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.25874","authors":[{"_id":"6a1506dbb57a1823d5708a2f","name":"Kaining Ying","hidden":false},{"_id":"6a1506dbb57a1823d5708a30","name":"Hengrui Hu","hidden":false},{"_id":"6a1506dbb57a1823d5708a31","user":{"_id":"63a018f8e648d425374504ef","avatarUrl":"/avatars/4ff1f273ec6c65689b0443c8398ecaa2.svg","isPro":false,"fullname":"Siyu Ren","user":"Roy0702","type":"user","name":"Roy0702"},"name":"Siyu Ren","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:47:58.721Z","hidden":false},{"_id":"6a1506dbb57a1823d5708a32","name":"Jiamu Li","hidden":false},{"_id":"6a1506dbb57a1823d5708a33","name":"Fengjiao Chen","hidden":false},{"_id":"6a1506dbb57a1823d5708a34","name":"Ziwen Wang","hidden":false},{"_id":"6a1506dbb57a1823d5708a35","name":"Xuezhi Cao","hidden":false},{"_id":"6a1506dbb57a1823d5708a36","name":"Xunliang Cai","hidden":false},{"_id":"6a1506dbb57a1823d5708a37","name":"Henghui Ding","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation","submittedOnDailyBy":{"_id":"642cf9a6ad221e8f41d3e0d0","avatarUrl":"/avatars/5fed346e20870a697864af3975aef8c0.svg","isPro":false,"fullname":"Kaining Ying","user":"Kaining","type":"user","name":"Kaining"},"summary":"Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.","upvotes":43,"discussionId":"6a1506dbb57a1823d5708a38","projectPage":"https://meituan-longcat.github.io/WBench/","githubRepo":"https://github.com/meituan-longcat/WBench","githubRepoAddedBy":"user","ai_summary":"WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types.","ai_keywords":["interactive world models","multi-turn benchmark","video quality","setting adherence","interaction adherence","consistency","physics compliance","automatic sub-metrics","vision models","multimodal models"],"githubStars":16,"organization":{"_id":"68b28d79a176a9beb30d2049","name":"meituan-longcat","fullname":"LongCat","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68a2a29ab9d4c5698e02c747/CDCAx7X7rXDt7xjI-DoxG.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"642cf9a6ad221e8f41d3e0d0","avatarUrl":"/avatars/5fed346e20870a697864af3975aef8c0.svg","isPro":false,"fullname":"Kaining Ying","user":"Kaining","type":"user"},{"_id":"63a018f8e648d425374504ef","avatarUrl":"/avatars/4ff1f273ec6c65689b0443c8398ecaa2.svg","isPro":false,"fullname":"Siyu Ren","user":"Roy0702","type":"user"},{"_id":"6064a0eeb1703ddba0d458b9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1617207525789-noauth.png","isPro":false,"fullname":"Qiushi","user":"QiushiSun","type":"user"},{"_id":"62c14609ac1b639c2d87192c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1656833489364-noauth.png","isPro":false,"fullname":"SII-liangtianyi","user":"tianyilt","type":"user"},{"_id":"649d1d4c379eada9a580cf59","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/649d1d4c379eada9a580cf59/ucXv7KoJDEB3Phgn-Dn5E.png","isPro":false,"fullname":"xuhuang","user":"xuhuang87","type":"user"},{"_id":"624561c1939c9acec2103534","avatarUrl":"/avatars/3efdf797c53d153cea7415213fb5afc7.svg","isPro":false,"fullname":"Mukai Li","user":"kiaia","type":"user"},{"_id":"653a24e9313cf747714278a0","avatarUrl":"/avatars/158a8bd1ce4ba140125b89088a0ce9dd.svg","isPro":false,"fullname":"Edson","user":"OscarDo93589","type":"user"},{"_id":"669945e54ea6475a57b32703","avatarUrl":"/avatars/b387ca5bbe77247960833e5545f0387f.svg","isPro":false,"fullname":"Qi Jia","user":"KikiNLP","type":"user"},{"_id":"61fde97843eb0913fa2df67b","avatarUrl":"/avatars/6b739fa8ab23ba69accb5614d96b243b.svg","isPro":false,"fullname":"Luyi","user":"lulululuyi","type":"user"},{"_id":"66ac55581f639daa61257cd6","avatarUrl":"/avatars/0bb457dac5d2c17f2e97dc674cc3e5d0.svg","isPro":false,"fullname":"nsy","user":"nsy156","type":"user"},{"_id":"619ddd708ae9cafd72ab20d5","avatarUrl":"/avatars/6b44e4928de0fc27287bf922c3f1802d.svg","isPro":false,"fullname":"Chengcheng Han","user":"hccngu","type":"user"},{"_id":"656d9eb2b40203890228a4f8","avatarUrl":"/avatars/5ba0f9950292091faa2102b0975d3af8.svg","isPro":false,"fullname":"Zixian Huang","user":"njuhzx","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68b28d79a176a9beb30d2049","name":"meituan-longcat","fullname":"LongCat","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68a2a29ab9d4c5698e02c747/CDCAx7X7rXDt7xjI-DoxG.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.25874.md"}">

Papers

arxiv:2605.25874

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Published on May 25

· Submitted by

Kaining Ying on May 26

LongCat

Upvote

Authors:

Siyu Ren ,

Abstract

WBench presents a comprehensive multi-turn benchmark for evaluating interactive world models across five dimensions using 289 test cases and 1,058 interaction turns with diverse scenarios and interaction types.

AI-generated summary

Interactive world models are advancing rapidly, yet existing benchmarks cover only part of the required competencies, leaving no unified standard for systematic evaluation. To fill this gap, we introduce WBench, a comprehensive multi-turn benchmark for interactive world model evaluation along five dimensions, namely video quality, setting adherence, interaction adherence, consistency, and physics compliance. WBench contains 289 test cases and 1,058 interaction turns, where each case specifies a world setting and a multi-turn interaction sequence, covering diverse scenes, styles, subjects, and both first- and third-person perspectives, together with four interaction types, including navigation, subject action, event editing, and perspective switching. For navigation, WBench unifies text, 6-DoF pose, and discrete-action control, enabling evaluation of models with different native input interfaces. Evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments. Across 20 state-of-the-art models, we find that no single model performs strongly across all dimensions. We provide detailed diagnostic insights into the characteristic strengths, weaknesses, and open challenges of each model. Code and data are available at https://github.com/meituan-longcat/WBench.

View arXiv page View PDF Project page GitHub 16 Add to collection

Community

Kaining

Paper submitter about 5 hours ago

upload

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.25874

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.25874 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.25874 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.25874 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers