Hugging Face Daily Papers · June 16, 2026 · 4 min read

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

#model-release #multimodal #video-gen #robotics

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

One motivation behind LaWAM is that existing WAMs spend substantial computation generating future pixels, while policies ultimately only need a representation of future state evolution. We therefore investigate whether latent dynamics can serve as an effective predictive signal without video generation.</p>\n","updatedAt":"2026-06-16T13:33:04.089Z","author":{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","fullname":"jialei chen","name":"jialei02","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8811597228050232},"editors":["jialei02"],"editorAvatarUrls":["/avatars/a36b956d3204718962da2047fbf32d18.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15768","authors":[{"_id":"6a3112c2a0d4daae428603e5","user":{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","isPro":false,"fullname":"jialei chen","user":"jialei02","type":"user","name":"jialei02"},"name":"Jialei Chen","status":"claimed_verified","statusLastChangedAt":"2026-06-16T09:47:16.748Z","hidden":false},{"_id":"6a3112c2a0d4daae428603e6","name":"Kai Wang","hidden":false},{"_id":"6a3112c2a0d4daae428603e7","name":"Kang Chen","hidden":false},{"_id":"6a3112c2a0d4daae428603e8","name":"Shuaihang Chen","hidden":false},{"_id":"6a3112c2a0d4daae428603e9","name":"Feng Gao","hidden":false},{"_id":"6a3112c2a0d4daae428603ea","name":"Wenhao Tang","hidden":false},{"_id":"6a3112c2a0d4daae428603eb","name":"Zhiyuan Li","hidden":false},{"_id":"6a3112c2a0d4daae428603ec","name":"Weilin Liu","hidden":false},{"_id":"6a3112c2a0d4daae428603ed","name":"Zhuyu Yao","hidden":false},{"_id":"6a3112c2a0d4daae428603ee","name":"Boxun Li","hidden":false},{"_id":"6a3112c2a0d4daae428603ef","name":"Yuanbo Xu","hidden":false},{"_id":"6a3112c2a0d4daae428603f0","name":"Chao Yu","hidden":false}],"publishedAt":"2026-06-14T12:06:58.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies","submittedOnDailyBy":{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","isPro":false,"fullname":"jialei chen","user":"jialei02","type":"user","name":"jialei02"},"summary":"Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.","upvotes":1,"discussionId":"6a3112c2a0d4daae428603f1","projectPage":"https://rlinf.github.io/LaWAM/","githubRepo":"https://github.com/RLinf/LaWAM","githubRepoAddedBy":"user","ai_summary":"LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency.","ai_keywords":["Vision-Language-Action models","World-Action Models","latent visual subgoals","latent action model","vision foundation model","forward decoder","predictive dynamics","robot policies","scene evolution","action-chunk prediction"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":14,"organization":{"_id":"689ea978824b212c988bc8f5","name":"RLinf","fullname":"RLinf","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ea8a1a73ecc6940dbba3d/T2RGCw18z6lYP1WfkIGJ3.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","isPro":false,"fullname":"jialei chen","user":"jialei02","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"689ea978824b212c988bc8f5","name":"RLinf","fullname":"RLinf","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ea8a1a73ecc6940dbba3d/T2RGCw18z6lYP1WfkIGJ3.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15768.md","query":{}}">

Papers

arxiv:2606.15768

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Published on Jun 14

· Submitted by

jialei chen on Jun 16

RLinf

Upvote

Authors:

Jialei Chen ,

Abstract

LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.

View arXiv page View PDF Project page GitHub 14 Add to collection

Community

jialei02

Paper author Paper submitter about 6 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.15768

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.15768 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.15768 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.15768 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers