Hugging Face Daily Papers · · 4 min read

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

One motivation behind LaWAM is that existing WAMs spend substantial computation generating future pixels, while policies ultimately only need a representation of future state evolution. We therefore investigate whether latent dynamics can serve as an effective predictive signal without video generation.</p>\n","updatedAt":"2026-06-16T13:33:04.089Z","author":{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","fullname":"jialei chen","name":"jialei02","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8811597228050232},"editors":["jialei02"],"editorAvatarUrls":["/avatars/a36b956d3204718962da2047fbf32d18.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15768","authors":[{"_id":"6a3112c2a0d4daae428603e5","user":{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","isPro":false,"fullname":"jialei chen","user":"jialei02","type":"user","name":"jialei02"},"name":"Jialei Chen","status":"claimed_verified","statusLastChangedAt":"2026-06-16T09:47:16.748Z","hidden":false},{"_id":"6a3112c2a0d4daae428603e6","name":"Kai Wang","hidden":false},{"_id":"6a3112c2a0d4daae428603e7","name":"Kang Chen","hidden":false},{"_id":"6a3112c2a0d4daae428603e8","name":"Shuaihang Chen","hidden":false},{"_id":"6a3112c2a0d4daae428603e9","name":"Feng Gao","hidden":false},{"_id":"6a3112c2a0d4daae428603ea","name":"Wenhao Tang","hidden":false},{"_id":"6a3112c2a0d4daae428603eb","name":"Zhiyuan Li","hidden":false},{"_id":"6a3112c2a0d4daae428603ec","name":"Weilin Liu","hidden":false},{"_id":"6a3112c2a0d4daae428603ed","name":"Zhuyu Yao","hidden":false},{"_id":"6a3112c2a0d4daae428603ee","name":"Boxun Li","hidden":false},{"_id":"6a3112c2a0d4daae428603ef","name":"Yuanbo Xu","hidden":false},{"_id":"6a3112c2a0d4daae428603f0","name":"Chao Yu","hidden":false}],"publishedAt":"2026-06-14T12:06:58.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies","submittedOnDailyBy":{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","isPro":false,"fullname":"jialei chen","user":"jialei02","type":"user","name":"jialei02"},"summary":"Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.","upvotes":1,"discussionId":"6a3112c2a0d4daae428603f1","projectPage":"https://rlinf.github.io/LaWAM/","githubRepo":"https://github.com/RLinf/LaWAM","githubRepoAddedBy":"user","ai_summary":"LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency.","ai_keywords":["Vision-Language-Action models","World-Action Models","latent visual subgoals","latent action model","vision foundation model","forward decoder","predictive dynamics","robot policies","scene evolution","action-chunk prediction"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":14,"organization":{"_id":"689ea978824b212c988bc8f5","name":"RLinf","fullname":"RLinf","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ea8a1a73ecc6940dbba3d/T2RGCw18z6lYP1WfkIGJ3.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","isPro":false,"fullname":"jialei chen","user":"jialei02","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"689ea978824b212c988bc8f5","name":"RLinf","fullname":"RLinf","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ea8a1a73ecc6940dbba3d/T2RGCw18z6lYP1WfkIGJ3.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15768.md","query":{}}">
Papers
arxiv:2606.15768

LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies

Published on Jun 14
· Submitted by
jialei chen
on Jun 16
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency.

Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.

Community

Paper author Paper submitter about 6 hours ago

One motivation behind LaWAM is that existing WAMs spend substantial computation generating future pixels, while policies ultimately only need a representation of future state evolution. We therefore investigate whether latent dynamics can serve as an effective predictive signal without video generation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.15768
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.15768 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.15768 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.15768 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers