Hugging Face Daily Papers · May 15, 2026 · 3 min read

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/65ec01fd770aa0e25d9374dc/V5PW_iwLCeCDL0_LQIbFL.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/65ec01fd770aa0e25d9374dc/V5PW_iwLCeCDL0_LQIbFL.png\" alt=\"Fig2_preview\"></a></p>\n","updatedAt":"2026-05-15T03:36:22.674Z","author":{"_id":"65ec01fd770aa0e25d9374dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/yvLWwBEdAdHb-8EdUHg3n.jpeg","fullname":"Shijie Lian","name":"LiamLian0727","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.38913339376449585},"editors":["LiamLian0727"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/yvLWwBEdAdHb-8EdUHg3n.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.14712","authors":[{"_id":"6a069429b1a8cbabc9f0997b","name":"Shijie Lian","hidden":false},{"_id":"6a069429b1a8cbabc9f0997c","name":"Bin Yu","hidden":false},{"_id":"6a069429b1a8cbabc9f0997d","name":"Xiaopeng Lin","hidden":false},{"_id":"6a069429b1a8cbabc9f0997e","name":"Zhaolong Shen","hidden":false},{"_id":"6a069429b1a8cbabc9f0997f","name":"Laurence Tianruo Yang","hidden":false},{"_id":"6a069429b1a8cbabc9f09980","name":"Yurun Jin","hidden":false},{"_id":"6a069429b1a8cbabc9f09981","name":"Haishan Liu","hidden":false},{"_id":"6a069429b1a8cbabc9f09982","name":"Changti Wu","hidden":false},{"_id":"6a069429b1a8cbabc9f09983","name":"Hang Yuan","hidden":false},{"_id":"6a069429b1a8cbabc9f09984","name":"Cong Huang","hidden":false},{"_id":"6a069429b1a8cbabc9f09985","name":"Kai Chen","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation","submittedOnDailyBy":{"_id":"65ec01fd770aa0e25d9374dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/yvLWwBEdAdHb-8EdUHg3n.jpeg","isPro":false,"fullname":"Shijie Lian","user":"LiamLian0727","type":"user","name":"LiamLian0727"},"summary":"Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines","upvotes":14,"discussionId":"6a069429b1a8cbabc9f09986","githubRepo":"https://github.com/ZGC-EmbodyAI/IntentVLA","githubRepoAddedBy":"user","ai_summary":"IntentVLA is a history-conditioned visual-language action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations.","ai_keywords":["visual-language action","partial observability","short-horizon intents","history-conditioned","intent representation","rollout stability","ambiguity-aware benchmark","RoboTwin2","AliasBench","frame-conditioned"],"githubStars":0,"organization":{"_id":"6948d884070dda0c2ae35a78","name":"DeepCybo","fullname":"DeepCybo","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/QOsz6P_7AxyqGrjsRHTGk.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65ec01fd770aa0e25d9374dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/yvLWwBEdAdHb-8EdUHg3n.jpeg","isPro":false,"fullname":"Shijie Lian","user":"LiamLian0727","type":"user"},{"_id":"691fdd4d36d3f9fad4989b62","avatarUrl":"/avatars/97a1ad76fbe5eccb237418ea5fd9746b.svg","isPro":false,"fullname":"James Liu","user":"ymqn941016","type":"user"},{"_id":"65b8f1e44a5bc2a978100d1e","avatarUrl":"/avatars/1a2c104410445df5edd30c1f41f69a37.svg","isPro":false,"fullname":"Jerry Zhang","user":"Zhang-Jerry","type":"user"},{"_id":"6a012616b9330b623fa5914e","avatarUrl":"/avatars/05af9b38ab3eaa0ed56aab131af316fe.svg","isPro":false,"fullname":"tian","user":"Skylm","type":"user"},{"_id":"666945f71dfc3a62b5fd82c2","avatarUrl":"/avatars/8abc2212d1dec68565bf23e42b1a1abf.svg","isPro":false,"fullname":"Lobster","user":"ldp2211479","type":"user"},{"_id":"63d3b5f1640bb0f77173baea","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674819020331-noauth.jpeg","isPro":false,"fullname":"yubin","user":"VLyb","type":"user"},{"_id":"6970c725aa5af823d07e5aa0","avatarUrl":"/avatars/a23801d0ca1f122f1c1a7e3af39b9d1c.svg","isPro":false,"fullname":"haishan","user":"haishan12123","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"670b7f814dcd2ee512c5a86a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Vct92PMzrIOSWDkHHZ1Bb.png","isPro":false,"fullname":"hence","user":"John952","type":"user"},{"_id":"6903571a021ccd1275410c02","avatarUrl":"/avatars/f12aa39e7c227006b3e437eeb8d02c96.svg","isPro":false,"fullname":"lianghaozhe","user":"lianghaozhe","type":"user"},{"_id":"67b55e66d454cc4d10d21cfd","avatarUrl":"/avatars/3b18014fa7e603a5940175896f89372a.svg","isPro":false,"fullname":"Changti Wu","user":"MaplesWCT","type":"user"},{"_id":"64049e1d0ab5e22719f37be2","avatarUrl":"/avatars/1902cd469d231dc252ac92d33af8f5b2.svg","isPro":false,"fullname":"Mingrui Chen","user":"Walnutes","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6948d884070dda0c2ae35a78","name":"DeepCybo","fullname":"DeepCybo","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65ec01fd770aa0e25d9374dc/QOsz6P_7AxyqGrjsRHTGk.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.14712.md"}">

Papers

arxiv:2605.14712

IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

Published on May 14

· Submitted by

Shijie Lian on May 15

DeepCybo

Upvote

Authors:

Abstract

IntentVLA is a history-conditioned visual-language action framework that improves robot imitation learning stability by encoding short-horizon intents from visual observations, addressing challenges from partial observability and ambiguous observations.

AI-generated summary

Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines