\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/66a9cb830751ea3455c0618c/6PBPnlUKkvPCpyuGO4oBP.png\" alt=\"Light-WAM architecture\" width=\"515\">\n</p>\nLight-WAM: Efficient World Action Models with State-Fusion Action Decoding","html":"<p align=\"left\">\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/66a9cb830751ea3455c0618c/6PBPnlUKkvPCpyuGO4oBP.png\" alt=\"Light-WAM architecture\" width=\"515\">\n</p>\nLight-WAM: Efficient World Action Models with State-Fusion Action Decoding","updatedAt":"2026-06-09T14:28:36.320Z","author":{"_id":"66a9cb830751ea3455c0618c","avatarUrl":"/avatars/4e09d66c2ce919ec948e7b1cf1138f80.svg","fullname":"SII-L1ziang","name":"l1ziang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.49749231338500977},"editors":["l1ziang"],"editorAvatarUrls":["/avatars/4e09d66c2ce919ec948e7b1cf1138f80.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.08242","authors":[{"_id":"6a27ab606dde1c5ef75bd16e","user":{"_id":"66a9cb830751ea3455c0618c","avatarUrl":"/avatars/4e09d66c2ce919ec948e7b1cf1138f80.svg","isPro":false,"fullname":"SII-L1ziang","user":"l1ziang","type":"user","name":"l1ziang"},"name":"Ziang Li","status":"claimed_verified","statusLastChangedAt":"2026-06-09T12:41:13.106Z","hidden":false},{"_id":"6a27ab606dde1c5ef75bd16f","name":"Dongzhou Cheng","hidden":false},{"_id":"6a27ab606dde1c5ef75bd170","name":"Yibin Wang","hidden":false},{"_id":"6a27ab606dde1c5ef75bd171","name":"Shiyue Wang","hidden":false},{"_id":"6a27ab606dde1c5ef75bd172","name":"Xiaoyang Xu","hidden":false},{"_id":"6a27ab606dde1c5ef75bd173","name":"Lingxuan Weng","hidden":false},{"_id":"6a27ab606dde1c5ef75bd174","name":"Juan Wang","hidden":false},{"_id":"6a27ab606dde1c5ef75bd175","name":"Jiaqi Wang","hidden":false}],"publishedAt":"2026-06-06T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Light-WAM: Efficient World Action Models with State-Fusion Action Decoding","submittedOnDailyBy":{"_id":"66a9cb830751ea3455c0618c","avatarUrl":"/avatars/4e09d66c2ce919ec948e7b1cf1138f80.svg","isPro":false,"fullname":"SII-L1ziang","user":"l1ziang","type":"user","name":"l1ziang"},"summary":"World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.","upvotes":7,"discussionId":"6a27ab606dde1c5ef75bd176","githubRepo":"https://github.com/L1ziang/Light-WAM","githubRepoAddedBy":"user","ai_summary":"Light-WAM is a lightweight world action model for robot manipulation that uses a compact video backbone and downsampled latent space for efficient future-video supervision, combined with a StateFusionActionExpert for direct action prediction.","ai_keywords":["World Action Models","robot policy learning","future prediction","generative architectures","video backbone","downsampled latent space","StateFusionActionExpert","learned-query pooling","action chunks","inference latency","training throughput"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":25,"organization":{"_id":"6350bdf559bfa9a85d42fea4","name":"WuhanUniversity","fullname":"Wuhan Univeristy","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6350bd20aaee2ec378dfe506/Bu1Fwz4dAwjwzWv-vZ2FN.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66a9cb830751ea3455c0618c","avatarUrl":"/avatars/4e09d66c2ce919ec948e7b1cf1138f80.svg","isPro":false,"fullname":"SII-L1ziang","user":"l1ziang","type":"user"},{"_id":"654c6845bac6e6e49895a5b5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/KXQaAxulqr8jNBSpEaYM4.png","isPro":false,"fullname":"SII-Yibin Wang","user":"CodeGoat24","type":"user"},{"_id":"6842f3fa0ffbfcaee7d5f9f4","avatarUrl":"/avatars/c81cb1be2683e75e8da5354619be8812.svg","isPro":false,"fullname":"Willms Mihayo","user":"iMihayo","type":"user"},{"_id":"68f850d3fdcd856c15b23f66","avatarUrl":"/avatars/7caf934d09cddb7009e4ab87a7b9daaf.svg","isPro":false,"fullname":"Shiyue Wang","user":"wangsh1yue","type":"user"},{"_id":"658c19a2539b68adc77db11a","avatarUrl":"/avatars/af4c7ecf96253f696dec6c50f0326935.svg","isPro":false,"fullname":"Jack Smith","user":"x1a0yue","type":"user"},{"_id":"6a282c4f4b8574890c1eb1e6","avatarUrl":"/avatars/d33b8bdac16053c00683441a26503729.svg","isPro":false,"fullname":"ncca402","user":"ncca402","type":"user"},{"_id":"63849307b4d5a5b7f43f59fc","avatarUrl":"/avatars/7171d2652a916a2c13674cb717e870c1.svg","isPro":false,"fullname":"anthony Saint","user":"Anthony52233","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6350bdf559bfa9a85d42fea4","name":"WuhanUniversity","fullname":"Wuhan Univeristy","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6350bd20aaee2ec378dfe506/Bu1Fwz4dAwjwzWv-vZ2FN.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.08242.md"}">
Light-WAM: Efficient World Action Models with State-Fusion Action Decoding
Abstract
Light-WAM is a lightweight world action model for robot manipulation that uses a compact video backbone and downsampled latent space for efficient future-video supervision, combined with a StateFusionActionExpert for direct action prediction.
World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.
Community
Light-WAM: Efficient World Action Models with State-Fusion Action Decoding
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.08242 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.