<a href=\"https://cdn-uploads.huggingface.co/production/uploads/65ec01fd770aa0e25d9374dc/V5PW_iwLCeCDL0_LQIbFL.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65ec01fd770aa0e25d9374dc/V5PW_iwLCeCDL0_LQIbFL.png\" alt=\"Fig2_preview\"></a></p>\n","updatedAt":"2026-05-15T03:36:22.674Z","author":{"_id":"65ec01fd770aa0e25d9374dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/yvLWwBEdAdHb-8EdUHg3n.jpeg","fullname":"Shijie Lian","name":"LiamLian0727","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.38913339376449585},"editors":["LiamLian0727"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/yvLWwBEdAdHb-8EdUHg3n.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.14712","authors":[{"_id":"6a069429b1a8cbabc9f0997b","name":"Shijie Lian","hidden":false},{"_id":"6a069429b1a8cbabc9f0997c","name":"Bin Yu","hidden":false},{"_id":"6a069429b1a8cbabc9f0997d","name":"Xiaopeng Lin","hidden":false},{"_id":"6a069429b1a8cbabc9f0997e","name":"Zhaolong Shen","hidden":false},{"_id":"6a069429b1a8cbabc9f0997f","name":"Laurence Tianruo Yang","hidden":false},{"_id":"6a069429b1a8cbabc9f09980","name":"Yurun Jin","hidden":false},{"_id":"6a069429b1a8cbabc9f09981","name":"Haishan Liu","hidden":false},{"_id":"6a069429b1a8cbabc9f09982","name":"Changti Wu","hidden":false},{"_id":"6a069429b1a8cbabc9f09983","name":"Hang Yuan","hidden":false},{"_id":"6a069429b1a8cbabc9f09984","name":"Cong Huang","hidden":false},{"_id":"6a069429b1a8cbabc9f09985","name":"Kai Chen","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation","submittedOnDailyBy":{"_id":"65ec01fd770aa0e25d9374dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/yvLWwBEdAdHb-8EdUHg3n.jpeg","isPro":false,"fullname":"Shijie Lian","user":"LiamLian0727","type":"user","name":"LiamLian0727"},"summary":"Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines","upvotes":14,"discussionId":"6a069429b1a8cbabc9f09986","githubRepo":"https://github.com/ZGC-EmbodyAI/IntentVLA","githubRepoAddedBy":"user","ai_summary":"IntentVLA is a history-conditioned visual-language action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations.","ai_keywords":["visual-language action","partial observability","short-horizon intents","history-conditioned","intent representation","rollout stability","ambiguity-aware benchmark","RoboTwin2","AliasBench","frame-conditioned"],"githubStars":0,"organization":{"_id":"6948d884070dda0c2ae35a78","name":"DeepCybo","fullname":"DeepCybo","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/QOsz6P_7AxyqGrjsRHTGk.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65ec01fd770aa0e25d9374dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/yvLWwBEdAdHb-8EdUHg3n.jpeg","isPro":false,"fullname":"Shijie Lian","user":"LiamLian0727","type":"user"},{"_id":"691fdd4d36d3f9fad4989b62","avatarUrl":"/avatars/97a1ad76fbe5eccb237418ea5fd9746b.svg","isPro":false,"fullname":"James Liu","user":"ymqn941016","type":"user"},{"_id":"65b8f1e44a5bc2a978100d1e","avatarUrl":"/avatars/1a2c104410445df5edd30c1f41f69a37.svg","isPro":false,"fullname":"Jerry Zhang","user":"Zhang-Jerry","type":"user"},{"_id":"6a012616b9330b623fa5914e","avatarUrl":"/avatars/05af9b38ab3eaa0ed56aab131af316fe.svg","isPro":false,"fullname":"tian","user":"Skylm","type":"user"},{"_id":"666945f71dfc3a62b5fd82c2","avatarUrl":"/avatars/8abc2212d1dec68565bf23e42b1a1abf.svg","isPro":false,"fullname":"Lobster","user":"ldp2211479","type":"user"},{"_id":"63d3b5f1640bb0f77173baea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674819020331-noauth.jpeg","isPro":false,"fullname":"yubin","user":"VLyb","type":"user"},{"_id":"6970c725aa5af823d07e5aa0","avatarUrl":"/avatars/a23801d0ca1f122f1c1a7e3af39b9d1c.svg","isPro":false,"fullname":"haishan","user":"haishan12123","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"670b7f814dcd2ee512c5a86a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Vct92PMzrIOSWDkHHZ1Bb.png","isPro":false,"fullname":"hence","user":"John952","type":"user"},{"_id":"6903571a021ccd1275410c02","avatarUrl":"/avatars/f12aa39e7c227006b3e437eeb8d02c96.svg","isPro":false,"fullname":"lianghaozhe","user":"lianghaozhe","type":"user"},{"_id":"67b55e66d454cc4d10d21cfd","avatarUrl":"/avatars/3b18014fa7e603a5940175896f89372a.svg","isPro":false,"fullname":"Changti Wu","user":"MaplesWCT","type":"user"},{"_id":"64049e1d0ab5e22719f37be2","avatarUrl":"/avatars/1902cd469d231dc252ac92d33af8f5b2.svg","isPro":false,"fullname":"Mingrui Chen","user":"Walnutes","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6948d884070dda0c2ae35a78","name":"DeepCybo","fullname":"DeepCybo","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/QOsz6P_7AxyqGrjsRHTGk.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.14712.md"}">
IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
IntentVLA is a history-conditioned visual-language action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations.
AI-generated summary
Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.14712 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.14712 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.14712 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.