Qwen-VLA is a unified embodied foundation model that extends Qwen's vision-language stack to support continuous action and trajectory generation across diverse robot platforms, tasks, and environments.</p>\n","updatedAt":"2026-05-29T02:34:02.545Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":307,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.925818145275116},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1925952ee461f570536490","author":{"_id":"6752cc1a10576e69f9bdc542","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6752cc1a10576e69f9bdc542/WkFgo6vx07H6IVLRZmFO_.jpeg","fullname":"Chanyoung Kim","name":"chanyoungkim","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false},"createdAt":"2026-05-29T05:35:17.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"awesome","html":"<p>awesome</p>\n","updatedAt":"2026-05-29T05:35:17.103Z","author":{"_id":"6752cc1a10576e69f9bdc542","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6752cc1a10576e69f9bdc542/WkFgo6vx07H6IVLRZmFO_.jpeg","fullname":"Chanyoung Kim","name":"chanyoungkim","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.4128333032131195},"editors":["chanyoungkim"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6752cc1a10576e69f9bdc542/WkFgo6vx07H6IVLRZmFO_.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1a4072f4090276d1e0827e","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:42:10.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MotuBrain: An Advanced World Action Model for Robot Control](https://huggingface.co/papers/2604.27792) (2026)\n* [StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing](https://huggingface.co/papers/2604.05014) (2026)\n* [FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies](https://huggingface.co/papers/2605.27284) (2026)\n* [GEM: Generative Supervision Helps Embodied Intelligence](https://huggingface.co/papers/2605.28548) (2026)\n* [PhysBrain 1.0 Technical Report](https://huggingface.co/papers/2605.15298) (2026)\n* [Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines](https://huggingface.co/papers/2604.23001) (2026)\n* [Cortex 2.0: Grounding World Models in Real-World Industrial Deployment](https://huggingface.co/papers/2604.20246) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.27792\">MotuBrain: An Advanced World Action Model for Robot Control</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.05014\">StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.27284\">FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.28548\">GEM: Generative Supervision Helps Embodied Intelligence</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.15298\">PhysBrain 1.0 Technical Report</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.23001\">Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.20246\">Cortex 2.0: Grounding World Models in Real-World Industrial Deployment</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:42:10.734Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.715430736541748},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[{"reaction":"👍","users":["RealAsmaAljneibi"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30280","authors":[{"_id":"6a18fb1656b4bb14ec65ce86","name":"Qiuyue Wang","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce87","name":"Mingsheng Li","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce88","name":"Jian Guan","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce89","name":"Jinhui Ye","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce8a","name":"Sicheng Xie","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce8b","name":"Yitao Liu","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce8c","name":"Junhao Chen","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce8d","name":"Zhixuan Liang","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce8e","name":"Jie Zhang","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce8f","name":"Xintong Hu","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce90","name":"Xuhong Huang","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce91","name":"Pei Lin","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce92","name":"Junyang Lin","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce93","name":"Dayiheng Liu","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce94","name":"Shuai Bai","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce95","name":"Jingren Zhou","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce96","name":"Jiazhao Zhang","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce97","name":"Haoqi Yuan","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce98","name":"Gengze Zhou","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce99","name":"Hang Yin","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce9a","name":"Ye Wang","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce9b","name":"Yiyang Huang","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce9c","name":"Zixing Lei","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce9d","name":"Wujian Peng","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce9e","name":"Delin Chen","hidden":false},{"_id":"6a18fb1656b4bb14ec65ce9f","name":"Yingming Zheng","hidden":false},{"_id":"6a18fb1656b4bb14ec65cea0","name":"Jingyang Fan","hidden":false},{"_id":"6a18fb1656b4bb14ec65cea1","name":"Xianwei Zhuang","hidden":false},{"_id":"6a18fb1656b4bb14ec65cea2","name":"Xin Zhou","hidden":false},{"_id":"6a18fb1656b4bb14ec65cea3","name":"Haoyang Li","hidden":false},{"_id":"6a18fb1656b4bb14ec65cea4","name":"Anzhe Chen","hidden":false},{"_id":"6a18fb1656b4bb14ec65cea5","name":"Tong Zhang","hidden":false},{"_id":"6a18fb1656b4bb14ec65cea6","name":"Xuejing Liu","hidden":false},{"_id":"6a18fb1656b4bb14ec65cea7","name":"Yuchong Sun","hidden":false},{"_id":"6a18fb1656b4bb14ec65cea8","name":"Ruizhe Chen","hidden":false},{"_id":"6a18fb1656b4bb14ec65cea9","name":"Zhaohai Li","hidden":false},{"_id":"6a18fb1656b4bb14ec65ceaa","name":"Chenxu Lü","hidden":false},{"_id":"6a18fb1656b4bb14ec65ceab","name":"Zhibo Yang","hidden":false},{"_id":"6a18fb1656b4bb14ec65ceac","name":"Tao Yu","hidden":false},{"_id":"6a18fb1656b4bb14ec65cead","name":"Xionghui Chen","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.","upvotes":90,"discussionId":"6a18fb1656b4bb14ec65ceae","projectPage":"https://qwen.ai/blog?id=qwenvla","ai_summary":"A unified vision-language-action model is presented that integrates diverse embodied decision-making tasks through a shared architecture and training approach, demonstrating strong performance across manipulation, navigation, and trajectory prediction with generalization across different robot platforms and environments.","ai_keywords":["vision-language-action model","DiT-based action decoder","joint pretraining","embodiment-aware prompt conditioning","visual grounding","spatial reasoning","continuous action generation","trajectory prediction","multi-task performance","out-of-distribution generalization"],"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"651f8133dbf879b8c58f5136","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/651f8133dbf879b8c58f5136/0L8Ecgi5Ietkm_DchJwE-.png","isPro":false,"fullname":"Zikai Zhou","user":"Klayand","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"636f6e8a31af06da86499ebc","avatarUrl":"/avatars/9430fbc05774aad8e46c0861769b3c30.svg","isPro":false,"fullname":"Yeongtak","user":"Yeongtak","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"64a66a4b1e147815fbc49380","avatarUrl":"/avatars/970f18def60e0f720d0b32c4d0bbc573.svg","isPro":false,"fullname":"XJW","user":"Cyberknight2077","type":"user"},{"_id":"649bce4f200e2dff194d9883","avatarUrl":"/avatars/b55a8bdc6f7e2bf9de5f26dc1d87bee3.svg","isPro":false,"fullname":"Wujian Peng","user":"wjpoom","type":"user"},{"_id":"6752cc1a10576e69f9bdc542","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6752cc1a10576e69f9bdc542/WkFgo6vx07H6IVLRZmFO_.jpeg","isPro":false,"fullname":"Chanyoung Kim","user":"chanyoungkim","type":"user"},{"_id":"64fbd4e69a62bb2791b3a665","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64fbd4e69a62bb2791b3a665/ZEMtU8O0z98ryeRCG3l_K.jpeg","isPro":false,"fullname":"Zhonghao Yan","user":"zzzyzh","type":"user"},{"_id":"66842b38c1be1cd1690efb94","avatarUrl":"/avatars/38df7cd7a3516f77dfd3e5d8cf83d9b3.svg","isPro":false,"fullname":"nanatata","user":"nanatata","type":"user"},{"_id":"6570450a78d7aca0c361a177","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6570450a78d7aca0c361a177/MX7jHhTQwLs-BvYIu5rqb.jpeg","isPro":false,"fullname":"Harold Chen","user":"Harold328","type":"user"},{"_id":"67822df363ffb0435bfc869d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Y0ox7tpkYbdy3oZvX1x8M.png","isPro":false,"fullname":"杨枢栋","user":"luppppy","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"64c8b5837fe12ecd0a7e92eb","name":"Qwen","fullname":"Qwen","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/620760a26e3b7210c2ff1943/-s1gyJfvbE1RgO5iBeNOi.png"}}">
Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A unified vision-language-action model is presented that integrates diverse embodied decision-making tasks through a shared architecture and training approach, demonstrating strong performance across manipulation, navigation, and trajectory prediction with generalization across different robot platforms and environments.
AI-generated summary
Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.
Community
Qwen-VLA is a unified embodied foundation model that extends Qwen's vision-language stack to support continuous action and trajectory generation across diverse robot platforms, tasks, and environments.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.30280 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.30280 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.30280 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.