Hugging Face Daily Papers · · 5 min read

World Action Models: A Survey

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at <a href=\"https://world-action-models.github.io/\" rel=\"nofollow\">https://world-action-models.github.io/</a>.</p>\n","updatedAt":"2026-06-23T02:59:31.059Z","author":{"_id":"643a6e89a856622f9788bf67","avatarUrl":"/avatars/419c0379f072295b27d4bfe2f8fb946d.svg","fullname":"qiuhong shen","name":"florinshum","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8850391507148743},"editors":["florinshum"],"editorAvatarUrls":["/avatars/419c0379f072295b27d4bfe2f8fb946d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.20781","authors":[{"_id":"6a39f61afdcd3514343bb52b","name":"Qiuhong Shen","hidden":false},{"_id":"6a39f61afdcd3514343bb52c","name":"Shihua Zhang","hidden":false},{"_id":"6a39f61afdcd3514343bb52d","name":"Yue Liao","hidden":false},{"_id":"6a39f61afdcd3514343bb52e","user":{"_id":"6706ab1168e9971e91bad6f7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/tWSXpBEAm0d8gTDWFRxTS.png","isPro":false,"fullname":"LIQIIIII","user":"LIQIIIII","type":"user","name":"LIQIIIII"},"name":"Qi Li","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:56:48.262Z","hidden":false},{"_id":"6a39f61afdcd3514343bb52f","name":"Zhenxiong Tan","hidden":false},{"_id":"6a39f61afdcd3514343bb530","name":"Shizun Wang","hidden":false},{"_id":"6a39f61afdcd3514343bb531","name":"Shuicheng Yan","hidden":false},{"_id":"6a39f61afdcd3514343bb532","name":"Xinchao Wang","hidden":false}],"publishedAt":"2026-06-18T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"World Action Models: A Survey","submittedOnDailyBy":{"_id":"643a6e89a856622f9788bf67","avatarUrl":"/avatars/419c0379f072295b27d4bfe2f8fb946d.svg","isPro":false,"fullname":"qiuhong shen","user":"florinshum","type":"user","name":"florinshum"},"summary":"World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at https://world-action-models.github.io/.","upvotes":37,"discussionId":"6a39f61bfdcd3514343bb533","projectPage":"https://world-action-models.github.io/","githubRepo":"https://github.com/world-action-models/awesome-world-action-models","githubRepoAddedBy":"user","ai_summary":"World Action Models are predictive-action systems that generate future states for decision-making, with designs balancing representational richness against computational constraints.","ai_keywords":["world action models","embodied predictive-action models","video generation models","vision-language backbones","predictive substrate","action coupling","deployment regime","interactability","causality","physical plausibility","generalization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":233,"organization":{"_id":"6508ab2b349930913196378b","name":"NationalUniversityofSingapore","fullname":"National University of Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZYUmpSMsa5Whihw3me2Bw.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6706ab1168e9971e91bad6f7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/tWSXpBEAm0d8gTDWFRxTS.png","isPro":false,"fullname":"LIQIIIII","user":"LIQIIIII","type":"user"},{"_id":"65811eeaa2284a018e51f1ba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/dH8UZj6Kk5HJkI1DItCNm.jpeg","isPro":false,"fullname":"Zigeng Chen","user":"Zigeng","type":"user"},{"_id":"643a6e89a856622f9788bf67","avatarUrl":"/avatars/419c0379f072295b27d4bfe2f8fb946d.svg","isPro":false,"fullname":"qiuhong shen","user":"florinshum","type":"user"},{"_id":"6860fe55a1ab4d5c885c3edf","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/HNSwVGoTd33mgbyjxA-fc.jpeg","isPro":false,"fullname":"QIN ZHIBIN","user":"tuantuan0321","type":"user"},{"_id":"64828d2b10cd9ffea8b6c14c","avatarUrl":"/avatars/02f3ae1fc567435e41d3892ef44a290a.svg","isPro":false,"fullname":"Deyu Bo","user":"bdy","type":"user"},{"_id":"647dd8f9a49bffab5d6fe46e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UkYcsNfnvKOotfKfTNcEk.png","isPro":false,"fullname":"Yin Bo","user":"YINBO0927","type":"user"},{"_id":"6627cccfded9b7936d5d1d21","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6627cccfded9b7936d5d1d21/LKGr7EP7AirjmkbFZSn4o.jpeg","isPro":true,"fullname":"Guangnian Wan","user":"bigglesworthnotcat","type":"user"},{"_id":"63ad98c142fd3b8dbae78f51","avatarUrl":"/avatars/d7a96f0ad47a729027757b850b5e9712.svg","isPro":false,"fullname":"Yuxin Song","user":"syxbb","type":"user"},{"_id":"6569474501c02495cec2cbae","avatarUrl":"/avatars/5d383738269a092192f3822d0248fd43.svg","isPro":false,"fullname":"Yibo Li","user":"liushiliushi","type":"user"},{"_id":"6624f53748e016b5ea587d40","avatarUrl":"/avatars/f8c16f45de0c3e32437f6e960a5b0959.svg","isPro":false,"fullname":"Shihua Zhang","user":"SuhZhang","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"668e740f1173ab43d9d9ed5e","avatarUrl":"/avatars/caa9b47c2a5f6d6d679759b8b234a0ab.svg","isPro":false,"fullname":"Zeqing Wang","user":"INV-WZQ","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6508ab2b349930913196378b","name":"NationalUniversityofSingapore","fullname":"National University of Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZYUmpSMsa5Whihw3me2Bw.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.20781.md","query":{}}">
Papers
arxiv:2606.20781

World Action Models: A Survey

Published on Jun 18
· Submitted by
qiuhong shen
on Jun 23
Authors:
,
,
,
,
,
,

Abstract

World Action Models are predictive-action systems that generate future states for decision-making, with designs balancing representational richness against computational constraints.

World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at https://world-action-models.github.io/.

Community

Paper submitter about 22 hours ago

World Action Models (WAMs) are embodied predictive-action models that make a forecast of the future available to action. Recent WAMs repurpose large video generation models, and a parallel line relies on language or vision-language backbones without a video-generation core. This rapid expansion has blurred the boundary among broad world models, video generation models, action-grounded video world models, Vision-Language-Action policies, and WAMs. This survey gives the field a common account. It first clarifies these boundaries, then organizes existing works through two complementary views. The first view asks what each method is required to generate, spanning rendered futures, latent futures, and video-generation-free action reasoning. The second view decomposes each method by predictive substrate, backbone, action coupling, and deployment regime. This anatomy supports a unified discussion of interactability, causality, persistence, physical plausibility, and generalization, followed by data, evaluation, and open challenges. Across these axes, a consistent design pattern emerges: WAMs are not simply video generators with action heads, but predictive-action methods whose design choices trade representational richness against compute, memory, latency, and action-label cost. The field is moving toward methods that generate less of the future while preserving what control requires. The survey homepage is available at https://world-action-models.github.io/.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.20781
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.20781 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.20781 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.20781 in a Space README.md to link it from this page.

Collections including this paper 3

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers