Hugging Face Daily Papers · June 11, 2026 · 3 min read

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Cool Embodied Manipulation framework</p>\n","updatedAt":"2026-06-11T02:24:40.541Z","author":{"_id":"682d4f066192e882a02f5317","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/682d4f066192e882a02f5317/cf5hHJN22wyogTINlFPbv.jpeg","fullname":"Zefu Lin","name":"Chedan86","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.903932511806488},"editors":["Chedan86"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/682d4f066192e882a02f5317/cf5hHJN22wyogTINlFPbv.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12403","authors":[{"_id":"6a2a179480a9c7c6830c0e84","name":"Zefu Lin","hidden":false},{"_id":"6a2a179480a9c7c6830c0e85","user":{"_id":"675ba7c4319e6dc4564a96aa","avatarUrl":"/avatars/51bad0c238723cc5ec1c067551ffba9f.svg","isPro":false,"fullname":"crx","user":"rxcui","type":"user","name":"rxcui"},"name":"Rongxu Cui","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:55.663Z","hidden":false},{"_id":"6a2a179480a9c7c6830c0e86","name":"Junjia Xu","hidden":false},{"_id":"6a2a179480a9c7c6830c0e87","name":"Xiaojuan Jin","hidden":false},{"_id":"6a2a179480a9c7c6830c0e88","name":"Wenling Li","hidden":false},{"_id":"6a2a179480a9c7c6830c0e89","name":"Lue Fan","hidden":false},{"_id":"6a2a179480a9c7c6830c0e8a","name":"Zhaoxiang Zhang","hidden":false}],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"World Pilot: Steering Vision-Language-Action Models with World-Action Priors","submittedOnDailyBy":{"_id":"682d4f066192e882a02f5317","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/682d4f066192e882a02f5317/cf5hHJN22wyogTINlFPbv.jpeg","isPro":false,"fullname":"Zefu Lin","user":"Chedan86","type":"user","name":"Chedan86"},"summary":"Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/","upvotes":22,"discussionId":"6a2a179480a9c7c6830c0e8b","projectPage":"https://world-pilot.github.io/","githubRepo":"https://github.com/ZefuLin/WorldPilot","githubRepoAddedBy":"user","ai_summary":"World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving superior performance in zero-shot out-of-distribution manipulation tasks.","ai_keywords":["Vision-Language-Action models","World-Action Model","latent steering","action steering","scene-evolution latent","anticipated trajectory","motion prior","zero-shot OOD benchmark","real-robot settings","manipulation tasks"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":9},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"682d4f066192e882a02f5317","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/682d4f066192e882a02f5317/cf5hHJN22wyogTINlFPbv.jpeg","isPro":false,"fullname":"Zefu Lin","user":"Chedan86","type":"user"},{"_id":"66a889cbe0420835a19e6f12","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66a889cbe0420835a19e6f12/ImhW2qwlUAm7j8FdunA2B.jpeg","isPro":false,"fullname":"Weizhi Zhao","user":"WeizhiZhao","type":"user"},{"_id":"652f68596504631c10fe92a5","avatarUrl":"/avatars/91fac632a94de853bc7100b20a950cb4.svg","isPro":false,"fullname":"yaoyao","user":"yaoyao-jpg","type":"user"},{"_id":"6875f3e32944cd16067c2522","avatarUrl":"/avatars/db1231ac047fa5d72be60f04c29189a8.svg","isPro":false,"fullname":"hanwen wang","user":"gothicwhw","type":"user"},{"_id":"67fc60bbcba7e1d6e45f70a5","avatarUrl":"/avatars/07c3870146f60ead360201c362e903f6.svg","isPro":false,"fullname":"lmj","user":"llmmjj13","type":"user"},{"_id":"6347ad8d668503743bd4e48a","avatarUrl":"/avatars/b9de0d65192bce1267c068e11f219576.svg","isPro":false,"fullname":"zhoumengqi","user":"zhoumq","type":"user"},{"_id":"649ecf9827145c4463240177","avatarUrl":"/avatars/27696cf31790a3d58d8be2e0c983800e.svg","isPro":false,"fullname":"Lue Fan","user":"Abyssaledge","type":"user"},{"_id":"675ba7c4319e6dc4564a96aa","avatarUrl":"/avatars/51bad0c238723cc5ec1c067551ffba9f.svg","isPro":false,"fullname":"crx","user":"rxcui","type":"user"},{"_id":"68a53e8a5dc0926bfba586eb","avatarUrl":"/avatars/ce19e7f5d77b6a1edfa575646ebfdeff.svg","isPro":false,"fullname":"Shao Congyu","user":"scy-cell","type":"user"},{"_id":"66ef6fd0ea7d19a2399d6b1f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/rJIMhaB4NB8bvBtmsw1vA.png","isPro":false,"fullname":"Yang Liu","user":"TeslaYang123","type":"user"},{"_id":"67c440717eba0d79a8fbfaca","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Qzr7yiOlNdVriJB4uQSR6.png","isPro":false,"fullname":"张昊宇","user":"XZHY528","type":"user"},{"_id":"68b16999cf872b3fd0451557","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/aYtIBZ0LDWsRtCRE5Msbw.png","isPro":false,"fullname":"BingZhan","user":"bigbigzi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12403.md"}">

Papers

arxiv:2606.12403

World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Published on Jun 10

· Submitted by

Zefu Lin on Jun 11

Upvote

Authors:

Rongxu Cui ,

Abstract

World Pilot enhances Vision-Language-Action models by incorporating dynamic scene evolution and trajectory priors from a World-Action Model, achieving superior performance in zero-shot out-of-distribution manipulation tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision-Language-Action (VLA) models inherit semantic grounding from large-scale pretraining and perform competently across in-distribution manipulation tasks. This grounding, however, is built on static image-text pairs, whereas manipulation is a continuous, contact-rich process whose dynamics this pretraining cannot capture. We present World Pilot, a VLA framework that augments the policy with priors from a World-Action Model (WAM), routed into the decision chain through two complementary pathways. Latent Steering conditions the perception layer on a scene-evolution latent, and Action Steering supplies an anticipated trajectory as a motion prior to the action generator. Together the two priors equip the VLA with an anticipated view of the scene and a trajectory-level motion hint alongside its semantic conditioning, and the scene-evolution prior remains effective even when supplied by a video-pretrained world model that has not been action-post-trained. World Pilot attains a state-of-the-art Total success rate of 84.7% on the LIBERO-Plus zero-shot OOD benchmark and the highest success rate on every real-robot setting across four manipulation tasks, with the largest margins under shifts in viewpoint, geometry, deformable state, and pose. Project Website: https://world-pilot.github.io/