Hugging Face Daily Papers · · 4 min read

WALL-WM: Carving World Action Modeling at the Event Joints

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

WALL-WM<br>Carving World Action Modeling at the Event Joints</p>\n<p>using semantically coherent action events as the atomic unit of learning. It pairs event-grounded pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling. From the same event-pretrained backbone, it supports two inference modes: event mode for variable-length execution and unified mode with Staircase Decoding for fixed-length deployment.</p>\n","updatedAt":"2026-06-03T22:43:13.573Z","author":{"_id":"6790e2b74932687e24024b4a","avatarUrl":"/avatars/951f55648490e1f520483a3e425621dd.svg","fullname":"Ruili","name":"RuiliFeng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8670611381530762},"editors":["RuiliFeng"],"editorAvatarUrls":["/avatars/951f55648490e1f520483a3e425621dd.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.01955","authors":[{"_id":"6a20ad7615100c5272a8458d","name":"Shalfun Li","hidden":false},{"_id":"6a20ad7615100c5272a8458e","name":"Victor Yao","hidden":false},{"_id":"6a20ad7615100c5272a8458f","name":"Charles Yang","hidden":false},{"_id":"6a20ad7615100c5272a84590","name":"Truth Qu","hidden":false},{"_id":"6a20ad7615100c5272a84591","name":"Regis Cheng","hidden":false},{"_id":"6a20ad7615100c5272a84592","name":"Ryan Yu","hidden":false},{"_id":"6a20ad7615100c5272a84593","name":"Howard Lu","hidden":false},{"_id":"6a20ad7615100c5272a84594","name":"Newton Von","hidden":false},{"_id":"6a20ad7615100c5272a84595","name":"Vincent Chen","hidden":false},{"_id":"6a20ad7615100c5272a84596","name":"Yohann Tang","hidden":false},{"_id":"6a20ad7615100c5272a84597","name":"Maeve Zhang","hidden":false},{"_id":"6a20ad7615100c5272a84598","name":"Ellie Ma","hidden":false},{"_id":"6a20ad7615100c5272a84599","name":"Gody Li","hidden":false},{"_id":"6a20ad7615100c5272a8459a","name":"Sage Yang","hidden":false},{"_id":"6a20ad7615100c5272a8459b","name":"Lorien Shu","hidden":false},{"_id":"6a20ad7615100c5272a8459c","name":"J. W. Gao","hidden":false},{"_id":"6a20ad7615100c5272a8459d","name":"Ethan Chen","hidden":false},{"_id":"6a20ad7615100c5272a8459e","name":"Colin Ye","hidden":false},{"_id":"6a20ad7615100c5272a8459f","name":"Yu Sun","hidden":false},{"_id":"6a20ad7615100c5272a845a0","name":"Elise Mon","hidden":false},{"_id":"6a20ad7615100c5272a845a1","name":"PS Zhang","hidden":false},{"_id":"6a20ad7615100c5272a845a2","name":"Neo Li","hidden":false},{"_id":"6a20ad7615100c5272a845a3","name":"Lily Li","hidden":false},{"_id":"6a20ad7615100c5272a845a4","name":"James Wang","hidden":false},{"_id":"6a20ad7615100c5272a845a5","name":"Ping Yang","hidden":false},{"_id":"6a20ad7615100c5272a845a6","name":"Chris Pan","hidden":false},{"_id":"6a20ad7615100c5272a845a7","name":"Lucy Liang","hidden":false},{"_id":"6a20ad7615100c5272a845a8","name":"Hang Su","hidden":false},{"_id":"6a20ad7615100c5272a845a9","name":"Roy Gan","hidden":false},{"_id":"6a20ad7615100c5272a845aa","name":"Hao Wang","hidden":false},{"_id":"6a20ad7615100c5272a845ab","name":"Qian Wang","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"WALL-WM: Carving World Action Modeling at the Event Joints","submittedOnDailyBy":{"_id":"6790e2b74932687e24024b4a","avatarUrl":"/avatars/951f55648490e1f520483a3e425621dd.svg","isPro":false,"fullname":"Ruili","user":"RuiliFeng","type":"user","name":"RuiliFeng"},"summary":"WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.","upvotes":0,"discussionId":"6a20ad7615100c5272a845ac","projectPage":"https://x2robot.com/pages/wm","githubRepo":"https://github.com/X-Square-Robot/wall-x","githubRepoAddedBy":"user","ai_summary":"WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference.","ai_keywords":["World Action Model","Vision-Language-Action","event-grounded pretraining","semantic events","fixed-length action chunks","VLA training","data ecosystem","event-level captions","cluster-balanced sampling","variable-length execution","unified mode","Staircase Decoding","Muon-optimizer","large-scale pretraining","generalization","state-of-the-art performance"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1036},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.01955.md"}">
Papers
arxiv:2606.01955

WALL-WM: Carving World Action Modeling at the Event Joints

Published on Jun 1
· Submitted by
Ruili
on Jun 3
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

WALL-WM advances video-action learning by using semantic events as learning units instead of fixed action chunks, enabling more flexible and scalable vision-language-action training and inference.

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

Community

Paper submitter about 3 hours ago

WALL-WM
Carving World Action Modeling at the Event Joints

using semantically coherent action events as the atomic unit of learning. It pairs event-grounded pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling. From the same event-pretrained backbone, it supports two inference modes: event mode for variable-length execution and unified mode with Staircase Decoding for fixed-length deployment.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.01955
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.01955 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.01955 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.01955 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers