Hugging Face Daily Papers · · 4 min read

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

At a Glance:</p>\n<ul>\n<li>Make world-action models fast enough for real-time robot control.</li>\n<li>Use a slow video DiT as a reusable long-horizon world planner and a fast action DiT as a closed-loop executor.</li>\n<li>Adapt cached planner context to the current observation through observation-guided video-context routing.</li>\n</ul>\n","updatedAt":"2026-06-09T05:04:02.295Z","author":{"_id":"66a3402e4c2093e582bdf511","avatarUrl":"/avatars/6f2e1f37b6a6cf9dc6df228482c0777a.svg","fullname":"Jisong Cai","name":"SereneC","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8730748891830444},"editors":["SereneC"],"editorAvatarUrls":["/avatars/6f2e1f37b6a6cf9dc6df228482c0777a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09811","authors":[{"_id":"6a279ab46dde1c5ef75bd104","name":"Jisong Cai","hidden":false},{"_id":"6a279ab46dde1c5ef75bd105","name":"Long Ling","hidden":false},{"_id":"6a279ab46dde1c5ef75bd106","name":"Shiwei Chu","hidden":false},{"_id":"6a279ab46dde1c5ef75bd107","name":"Zhongshan Liu","hidden":false},{"_id":"6a279ab46dde1c5ef75bd108","name":"Jiayue Kang","hidden":false},{"_id":"6a279ab46dde1c5ef75bd109","name":"Zhixuan Liang","hidden":false},{"_id":"6a279ab46dde1c5ef75bd10a","name":"Wenjie Xu","hidden":false},{"_id":"6a279ab46dde1c5ef75bd10b","name":"Yinan Mao","hidden":false},{"_id":"6a279ab46dde1c5ef75bd10c","name":"Weinan Zhang","hidden":false},{"_id":"6a279ab46dde1c5ef75bd10d","name":"Xiaokang Yang","hidden":false},{"_id":"6a279ab46dde1c5ef75bd10e","name":"Ru Ying","hidden":false},{"_id":"6a279ab46dde1c5ef75bd10f","name":"Ran Zheng","hidden":false},{"_id":"6a279ab46dde1c5ef75bd110","name":"Yao Mu","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing","submittedOnDailyBy":{"_id":"66a3402e4c2093e582bdf511","avatarUrl":"/avatars/6f2e1f37b6a6cf9dc6df228482c0777a.svg","isPro":false,"fullname":"Jisong Cai","user":"SereneC","type":"user","name":"SereneC"},"summary":"World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.","upvotes":10,"discussionId":"6a279ab46dde1c5ef75bd111","projectPage":"https://serene-sivy.github.io/aha-wam/","ai_summary":"AHA-WAM is an asynchronous world-action model that uses dual Diffusion Transformers to enable efficient long-horizon planning and real-time action execution in robotic manipulation tasks.","ai_keywords":["world-action models","Diffusion Transformer","dual Diffusion Transformer","world planner","action DiT","video DiT","layerwise joint attention","horizon-adaptive offset training","Observation-Guided Video-Context Routing","closed-loop control","robotic manipulation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66a3402e4c2093e582bdf511","avatarUrl":"/avatars/6f2e1f37b6a6cf9dc6df228482c0777a.svg","isPro":false,"fullname":"Jisong Cai","user":"SereneC","type":"user"},{"_id":"69a257d3d4e2e5e0c70a3006","avatarUrl":"/avatars/5570ae23dccb62c199cbb2366db8f404.svg","isPro":false,"fullname":"liuzhongshan","user":"liuzhongshan","type":"user"},{"_id":"662a471e94baa018b00c0f5c","avatarUrl":"/avatars/62a67a2ee6e4b9a7124f8b02b9b3f280.svg","isPro":false,"fullname":"Zhixuan Liang","user":"Liang-ZX","type":"user"},{"_id":"688f6879fa817fae4bb3b6f7","avatarUrl":"/avatars/d580c27ef01f1b27debd371b6cf0202e.svg","isPro":false,"fullname":"yingru","user":"ryanparrot","type":"user"},{"_id":"68ca688b54bf899d3e0befba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/gR6z8gNuk1ntgQmZFQ8jY.jpeg","isPro":false,"fullname":"Yi Liang","user":"pepaapu","type":"user"},{"_id":"65f9533b136fb8ddbd14e1fa","avatarUrl":"/avatars/d88f75da0448093ccd1babba2a37d73f.svg","isPro":false,"fullname":"Zhang","user":"WenyaoZhang","type":"user"},{"_id":"653b6632f02070b837a86335","avatarUrl":"/avatars/8bb78228127bc141d0e0d90963c9d178.svg","isPro":false,"fullname":"zr","user":"zr2008","type":"user"},{"_id":"6920024741ffb0ebe26aad08","avatarUrl":"/avatars/7e1721c26a718e3cb790126c232d5cec.svg","isPro":false,"fullname":"X","user":"Catherine1212","type":"user"},{"_id":"666fe732087633b71e800f1a","avatarUrl":"/avatars/5531b59f5e685a131d4fb073e8856b98.svg","isPro":false,"fullname":"Zhouheng Yao","user":"kaleidoyao","type":"user"},{"_id":"6925e3dcc1fe2ca9cb921dc7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6925e3dcc1fe2ca9cb921dc7/prNJ2Q8EPihH9Csof2Lfs.jpeg","isPro":false,"fullname":"happytech-web","user":"happytech-web","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09811.md"}">
Papers
arxiv:2606.09811

AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing

Published on Jun 8
· Submitted by
Jisong Cai
on Jun 9
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

AHA-WAM is an asynchronous world-action model that uses dual Diffusion Transformers to enable efficient long-horizon planning and real-time action execution in robotic manipulation tasks.

World-action models have emerged as a promising paradigm for robot manipulation, jointly modeling visual scene dynamics and actions to inject physical priors into policy learning. However, existing world-action models couple world prediction and action execution at the same temporal resolution, forcing the world branch to model near-term frame variations that are redundant and weakly informative. We posit that strictly binding world prediction and action execution to the same temporal rhythm may underutilize the potential of the video branch for embodied control. Therefore, we propose AHA-WAM, an Asynchronous Horizon-Adaptive World-Action Model built on a dual Diffusion Transformer (DiT) architecture that reorganizes world-action modeling around this temporal asymmetry. AHA-WAM instantiates the video DiT as a low-frequency world planner that maintains rolling key-value memory over past observations and exposes reusable layerwise latent context encoding long-horizon scene evolution, while a high-frequency action DiT executes short action chunks in closed loop by querying this context through layerwise joint attention. To support asynchronous execution, we introduce horizon-adaptive offset training and Observation-Guided Video-Context Routing (OVCR), which together let the action expert exploit long-horizon world context while remaining responsive to real-time execution state without rerunning the video DiT. Experiments on RoboTwin and real-world manipulation tasks show that AHA-WAM achieves state-of-the-art performance without any robot-data pretraining, attaining 92.80% average success on RoboTwin and 78.3% success across 4 real-world tasks, while reaching 24.17 Hz closed-loop control with a 4.59x speedup over Fast-WAM.

Community

Paper submitter about 3 hours ago

At a Glance:

  • Make world-action models fast enough for real-time robot control.
  • Use a slow video DiT as a reusable long-horizon world planner and a fast action DiT as a closed-loop executor.
  • Adapt cached planner context to the current observation through observation-guided video-context routing.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.09811
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09811 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09811 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09811 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers