Hugging Face Daily Papers · June 9, 2026 · 3 min read

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/66a9cb830751ea3455c0618c/6PBPnlUKkvPCpyuGO4oBP.png\" alt=\"Light-WAM architecture\" width=\"515\">\n</p>\nLight-WAM: Efficient World Action Models with State-Fusion Action Decoding","html":"<p align=\"left\">\n <img src=\"https://cdn-uploads.huggingface.co/production/uploads/66a9cb830751ea3455c0618c/6PBPnlUKkvPCpyuGO4oBP.png\" alt=\"Light-WAM architecture\" width=\"515\">\n</p>\nLight-WAM: Efficient World Action Models with State-Fusion Action Decoding","updatedAt":"2026-06-09T14:28:36.320Z","author":{"_id":"66a9cb830751ea3455c0618c","avatarUrl":"/avatars/4e09d66c2ce919ec948e7b1cf1138f80.svg","fullname":"SII-L1ziang","name":"l1ziang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.49749231338500977},"editors":["l1ziang"],"editorAvatarUrls":["/avatars/4e09d66c2ce919ec948e7b1cf1138f80.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.08242","authors":[{"_id":"6a27ab606dde1c5ef75bd16e","user":{"_id":"66a9cb830751ea3455c0618c","avatarUrl":"/avatars/4e09d66c2ce919ec948e7b1cf1138f80.svg","isPro":false,"fullname":"SII-L1ziang","user":"l1ziang","type":"user","name":"l1ziang"},"name":"Ziang Li","status":"claimed_verified","statusLastChangedAt":"2026-06-09T12:41:13.106Z","hidden":false},{"_id":"6a27ab606dde1c5ef75bd16f","name":"Dongzhou Cheng","hidden":false},{"_id":"6a27ab606dde1c5ef75bd170","name":"Yibin Wang","hidden":false},{"_id":"6a27ab606dde1c5ef75bd171","name":"Shiyue Wang","hidden":false},{"_id":"6a27ab606dde1c5ef75bd172","name":"Xiaoyang Xu","hidden":false},{"_id":"6a27ab606dde1c5ef75bd173","name":"Lingxuan Weng","hidden":false},{"_id":"6a27ab606dde1c5ef75bd174","name":"Juan Wang","hidden":false},{"_id":"6a27ab606dde1c5ef75bd175","name":"Jiaqi Wang","hidden":false}],"publishedAt":"2026-06-06T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"Light-WAM: Efficient World Action Models with State-Fusion Action Decoding","submittedOnDailyBy":{"_id":"66a9cb830751ea3455c0618c","avatarUrl":"/avatars/4e09d66c2ce919ec948e7b1cf1138f80.svg","isPro":false,"fullname":"SII-L1ziang","user":"l1ziang","type":"user","name":"l1ziang"},"summary":"World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.","upvotes":7,"discussionId":"6a27ab606dde1c5ef75bd176","githubRepo":"https://github.com/L1ziang/Light-WAM","githubRepoAddedBy":"user","ai_summary":"Light-WAM is a lightweight world action model for robot manipulation that uses a compact video backbone and downsampled latent space for efficient future-video supervision, combined with a StateFusionActionExpert for direct action prediction.","ai_keywords":["World Action Models","robot policy learning","future prediction","generative architectures","video backbone","downsampled latent space","StateFusionActionExpert","learned-query pooling","action chunks","inference latency","training throughput"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":25,"organization":{"_id":"6350bdf559bfa9a85d42fea4","name":"WuhanUniversity","fullname":"Wuhan Univeristy","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6350bd20aaee2ec378dfe506/Bu1Fwz4dAwjwzWv-vZ2FN.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66a9cb830751ea3455c0618c","avatarUrl":"/avatars/4e09d66c2ce919ec948e7b1cf1138f80.svg","isPro":false,"fullname":"SII-L1ziang","user":"l1ziang","type":"user"},{"_id":"654c6845bac6e6e49895a5b5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/KXQaAxulqr8jNBSpEaYM4.png","isPro":false,"fullname":"SII-Yibin Wang","user":"CodeGoat24","type":"user"},{"_id":"6842f3fa0ffbfcaee7d5f9f4","avatarUrl":"/avatars/c81cb1be2683e75e8da5354619be8812.svg","isPro":false,"fullname":"Willms Mihayo","user":"iMihayo","type":"user"},{"_id":"68f850d3fdcd856c15b23f66","avatarUrl":"/avatars/7caf934d09cddb7009e4ab87a7b9daaf.svg","isPro":false,"fullname":"Shiyue Wang","user":"wangsh1yue","type":"user"},{"_id":"658c19a2539b68adc77db11a","avatarUrl":"/avatars/af4c7ecf96253f696dec6c50f0326935.svg","isPro":false,"fullname":"Jack Smith","user":"x1a0yue","type":"user"},{"_id":"6a282c4f4b8574890c1eb1e6","avatarUrl":"/avatars/d33b8bdac16053c00683441a26503729.svg","isPro":false,"fullname":"ncca402","user":"ncca402","type":"user"},{"_id":"63849307b4d5a5b7f43f59fc","avatarUrl":"/avatars/7171d2652a916a2c13674cb717e870c1.svg","isPro":false,"fullname":"anthony Saint","user":"Anthony52233","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6350bdf559bfa9a85d42fea4","name":"WuhanUniversity","fullname":"Wuhan Univeristy","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6350bd20aaee2ec378dfe506/Bu1Fwz4dAwjwzWv-vZ2FN.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.08242.md"}">

Papers

arxiv:2606.08242

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Published on Jun 6

· Submitted by

SII-L1ziang on Jun 9

Wuhan Univeristy

Upvote

Authors:

Ziang Li ,

Abstract

Light-WAM is a lightweight world action model for robot manipulation that uses a compact video backbone and downsampled latent space for efficient future-video supervision, combined with a StateFusionActionExpert for direct action prediction.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

World Action Models (WAMs) extend robot policy learning by incorporating future prediction as an additional training objective, encouraging the policy to encode task-relevant temporal structure in its representations. Current WAMs often rely on large-scale generative architectures that incur high training costs and inference latency, making them difficult to deploy as efficient closed-loop policies. We propose Light-WAM, a lightweight World Action Model for efficient robot manipulation. Specifically, it is built with a compact video backbone and performs future-video supervision in a downsampled latent space, reducing the cost of video co-training while retaining its benefits for representation learning. For action prediction, Light-WAM introduces the StateFusionActionExpert, which reads adapted states from multiple backbone layers, fuses them through learned-query pooling, and directly predicts action chunks in a single forward pass. This design provides an efficient interface between video backbone representations and robot actions, avoiding the need for heavy generative action experts. Experiments demonstrate that Light-WAM maintains strong performance on LIBERO and achieves usable multi-task performance on RoboTwin 2.0, while using only 0.44B trainable parameters. It also achieves 72.03ms inference latency with 4.1GiB peak GPU memory and improved training throughput.

View arXiv page View PDF GitHub 25 Add to collection

Community

l1ziang

Paper author Paper submitter about 5 hours ago

Light-WAM architecture

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.08242

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.08242 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Light-WAM: Efficient World Action Models with State-Fusion Action Decoding

Abstract

Community

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers