One motivation behind LaWAM is that existing WAMs spend substantial computation generating future pixels, while policies ultimately only need a representation of future state evolution. We therefore investigate whether latent dynamics can serve as an effective predictive signal without video generation.</p>\n","updatedAt":"2026-06-16T13:33:04.089Z","author":{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","fullname":"jialei chen","name":"jialei02","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8811597228050232},"editors":["jialei02"],"editorAvatarUrls":["/avatars/a36b956d3204718962da2047fbf32d18.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15768","authors":[{"_id":"6a3112c2a0d4daae428603e5","user":{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","isPro":false,"fullname":"jialei chen","user":"jialei02","type":"user","name":"jialei02"},"name":"Jialei Chen","status":"claimed_verified","statusLastChangedAt":"2026-06-16T09:47:16.748Z","hidden":false},{"_id":"6a3112c2a0d4daae428603e6","name":"Kai Wang","hidden":false},{"_id":"6a3112c2a0d4daae428603e7","name":"Kang Chen","hidden":false},{"_id":"6a3112c2a0d4daae428603e8","name":"Shuaihang Chen","hidden":false},{"_id":"6a3112c2a0d4daae428603e9","name":"Feng Gao","hidden":false},{"_id":"6a3112c2a0d4daae428603ea","name":"Wenhao Tang","hidden":false},{"_id":"6a3112c2a0d4daae428603eb","name":"Zhiyuan Li","hidden":false},{"_id":"6a3112c2a0d4daae428603ec","name":"Weilin Liu","hidden":false},{"_id":"6a3112c2a0d4daae428603ed","name":"Zhuyu Yao","hidden":false},{"_id":"6a3112c2a0d4daae428603ee","name":"Boxun Li","hidden":false},{"_id":"6a3112c2a0d4daae428603ef","name":"Yuanbo Xu","hidden":false},{"_id":"6a3112c2a0d4daae428603f0","name":"Chao Yu","hidden":false}],"publishedAt":"2026-06-14T12:06:58.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies","submittedOnDailyBy":{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","isPro":false,"fullname":"jialei chen","user":"jialei02","type":"user","name":"jialei02"},"summary":"Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.","upvotes":1,"discussionId":"6a3112c2a0d4daae428603f1","projectPage":"https://rlinf.github.io/LaWAM/","githubRepo":"https://github.com/RLinf/LaWAM","githubRepoAddedBy":"user","ai_summary":"LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency.","ai_keywords":["Vision-Language-Action models","World-Action Models","latent visual subgoals","latent action model","vision foundation model","forward decoder","predictive dynamics","robot policies","scene evolution","action-chunk prediction"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":14,"organization":{"_id":"689ea978824b212c988bc8f5","name":"RLinf","fullname":"RLinf","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ea8a1a73ecc6940dbba3d/T2RGCw18z6lYP1WfkIGJ3.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68f1f89c227b5933de7e467b","avatarUrl":"/avatars/a36b956d3204718962da2047fbf32d18.svg","isPro":false,"fullname":"jialei chen","user":"jialei02","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"689ea978824b212c988bc8f5","name":"RLinf","fullname":"RLinf","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/689ea8a1a73ecc6940dbba3d/T2RGCw18z6lYP1WfkIGJ3.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15768.md","query":{}}">
LaWAM: Latent World Action Models for Efficient Dynamics-Aware Robot Policies
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
LaWAM enables efficient robot control by predicting compact latent visual subgoals instead of expensive video generation, achieving high performance with reduced computational latency.
Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.
Community
One motivation behind LaWAM is that existing WAMs spend substantial computation generating future pixels, while policies ultimately only need a representation of future state evolution. We therefore investigate whether latent dynamics can serve as an effective predictive signal without video generation.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.15768 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.15768 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.15768 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.