Hugging Face Daily Papers · · 3 min read

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Cool Embodied Manipulation framework</p>\n","updatedAt":"2026-06-11T02:24:40.541Z","author":{"_id":"682d4f066192e882a02f5317","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/682d4f066192e882a02f5317/cf5hHJN22wyogTINlFPbv.jpeg","fullname":"Zefu Lin","name":"Chedan86","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.903932511806488},"editors":["Chedan86"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/682d4f066192e882a02f5317/cf5hHJN22wyogTINlFPbv.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12403","authors":[{"_id":"6a2a179480a9c7c6830c0e84","name":"Zefu Lin","hidden":false},{"_id":"6a2a179480a9c7c6830c0e85","user":{"_id":"675ba7c4319e6dc4564a96aa","avatarUrl":"/avatars/51bad0c238723cc5ec1c067551ffba9f.svg","isPro":false,"fullname":"crx","user":"rxcui","type":"user","name":"rxcui"},"name":"Rongxu Cui","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:55.663Z","hidden":false},{"_id":"6a2a179480a9c7c6830c0e86","name":"Junjia Xu","hidden":false},{"_id":"6a2a179480a9c7c6830c0e87","name":"Xiaojuan Jin","hidden":false},{"_id":"6a2a179480a9c7c6830c0e88","name":"Wenling Li","hidden":false},{"_id":"6a2a179480a9c7c6830c0e89","name":"Lue Fan","hidden":false},{"_id":"6a2a179480a9c7c6830c0e8a","name":"Zhaoxiang Zhang","hidden":false}],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"World Pilot: Steering Vision-Language-Action Models with World-Action Priors","submittedOnDailyBy":{"_id":"682d4f066192e882a02f5317","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/682d4f066192e882a02f5317/cf5hHJN22wyogTINlFPbv.jpeg","isPro":false,"fullname":"Zefu Lin","user":"Chedan86","type":"user","name":"Chedan86"},"summary":"Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/","upvotes":22,"discussionId":"6a2a179480a9c7c6830c0e8b","projectPage":"https://world-pilot.github.io/","githubRepo":"https://github.com/ZefuLin/WorldPilot","githubRepoAddedBy":"user","ai_summary":"World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving superior performance in zero-shot out-of-distribution manipulation tasks.","ai_keywords":["Vision-Language-Action models","World-Action Model","latent steering","action steering","scene-evolution latent","anticipated trajectory","motion prior","zero-shot OOD benchmark","real-robot settings","manipulation tasks"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":9},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"682d4f066192e882a02f5317","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/682d4f066192e882a02f5317/cf5hHJN22wyogTINlFPbv.jpeg","isPro":false,"fullname":"Zefu Lin","user":"Chedan86","type":"user"},{"_id":"66a889cbe0420835a19e6f12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66a889cbe0420835a19e6f12/ImhW2qwlUAm7j8FdunA2B.jpeg","isPro":false,"fullname":"Weizhi Zhao","user":"WeizhiZhao","type":"user"},{"_id":"652f68596504631c10fe92a5","avatarUrl":"/avatars/91fac632a94de853bc7100b20a950cb4.svg","isPro":false,"fullname":"yaoyao","user":"yaoyao-jpg","type":"user"},{"_id":"6875f3e32944cd16067c2522","avatarUrl":"/avatars/db1231ac047fa5d72be60f04c29189a8.svg","isPro":false,"fullname":"hanwen wang","user":"gothicwhw","type":"user"},{"_id":"67fc60bbcba7e1d6e45f70a5","avatarUrl":"/avatars/07c3870146f60ead360201c362e903f6.svg","isPro":false,"fullname":"lmj","user":"llmmjj13","type":"user"},{"_id":"6347ad8d668503743bd4e48a","avatarUrl":"/avatars/b9de0d65192bce1267c068e11f219576.svg","isPro":false,"fullname":"zhoumengqi","user":"zhoumq","type":"user"},{"_id":"649ecf9827145c4463240177","avatarUrl":"/avatars/27696cf31790a3d58d8be2e0c983800e.svg","isPro":false,"fullname":"Lue Fan","user":"Abyssaledge","type":"user"},{"_id":"675ba7c4319e6dc4564a96aa","avatarUrl":"/avatars/51bad0c238723cc5ec1c067551ffba9f.svg","isPro":false,"fullname":"crx","user":"rxcui","type":"user"},{"_id":"68a53e8a5dc0926bfba586eb","avatarUrl":"/avatars/ce19e7f5d77b6a1edfa575646ebfdeff.svg","isPro":false,"fullname":"Shao Congyu","user":"scy-cell","type":"user"},{"_id":"66ef6fd0ea7d19a2399d6b1f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/rJIMhaB4NB8bvBtmsw1vA.png","isPro":false,"fullname":"Yang Liu","user":"TeslaYang123","type":"user"},{"_id":"67c440717eba0d79a8fbfaca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Qzr7yiOlNdVriJB4uQSR6.png","isPro":false,"fullname":"张昊宇","user":"XZHY528","type":"user"},{"_id":"68b16999cf872b3fd0451557","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/aYtIBZ0LDWsRtCRE5Msbw.png","isPro":false,"fullname":"BingZhan","user":"bigbigzi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12403.md"}">
Papers
arxiv:2606.12403

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Published on Jun 10
· Submitted by
Zefu Lin
on Jun 11
Authors:
,
,
,
,
,

Abstract

World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving superior performance in zero-shot out-of-distribution manipulation tasks.

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/

Community

Paper submitter about 18 hours ago

Cool Embodied Manipulation framework

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.12403
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12403 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers