Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., \"move gripper to the bowl\", \"lift upward\", \"pour the bottle\"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: <a href=\"https://insight-vla.github.io/\" rel=\"nofollow\">https://insight-vla.github.io/</a>.</p>\n","updatedAt":"2026-06-24T19:33:14.495Z","author":{"_id":"691bac894311763d9db471c2","avatarUrl":"/avatars/d4cf231224b0f992f5d2460f94762421.svg","fullname":"Maggie","name":"maggi3wang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9003055095672607},"editors":["maggi3wang"],"editorAvatarUrls":["/avatars/d4cf231224b0f992f5d2460f94762421.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.24884","authors":[{"_id":"6a3c2ff111420f88bdc06067","name":"Maggie Wang","hidden":false},{"_id":"6a3c2ff111420f88bdc06068","name":"Lars Osterberg","hidden":false},{"_id":"6a3c2ff111420f88bdc06069","name":"Stephen Tian","hidden":false},{"_id":"6a3c2ff111420f88bdc0606a","name":"Ola Shorinwa","hidden":false},{"_id":"6a3c2ff111420f88bdc0606b","name":"Jiajun Wu","hidden":false},{"_id":"6a3c2ff111420f88bdc0606c","name":"Mac Schwager","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/691bac894311763d9db471c2/kcqClpzj9K8O4jaMfShsy.png"],"publishedAt":"2026-06-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-24T00:00:00.000Z","title":"InSight: Self-Guided Skill Acquisition via Steerable VLAs","submittedOnDailyBy":{"_id":"691bac894311763d9db471c2","avatarUrl":"/avatars/d4cf231224b0f992f5d2460f94762421.svg","isPro":false,"fullname":"Maggie","user":"maggi3wang","type":"user","name":"maggi3wang"},"summary":"Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., \"move gripper to the bowl\", \"lift upward\", \"pour the bottle\"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: https://insight-vla.github.io.","upvotes":0,"discussionId":"6a3c2ff111420f88bdc0606d","projectPage":"https://insight-vla.github.io/","githubRepo":"https://github.com/insight-vla/insight","githubRepoAddedBy":"user","ai_summary":"InSight enables autonomous skill acquisition for vision-language-action models through primitive-action level steerability and automated demonstration generation.","ai_keywords":["Vision-language-action models","primitive-action level steerability","automated segmentation pipeline","VLM plan decomposition","end-effector poses","VLM-guided data flywheel","low-level control","continual skill acquisition"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6,"organization":{"_id":"672c672dcf09d152f4da04c4","name":"StanfordUniversity","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/vJI0POlzGMXL2878t1vz2.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"672c672dcf09d152f4da04c4","name":"StanfordUniversity","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/vJI0POlzGMXL2878t1vz2.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.24884.md","query":{}}">
InSight: Self-Guided Skill Acquisition via Steerable VLAs
Published on Jun 23
· Submitted by Maggie on Jun 24 Abstract
InSight enables autonomous skill acquisition for vision-language-action models through primitive-action level steerability and automated demonstration generation.
Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., "move gripper to the bowl", "lift upward", "pour the bottle"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: https://insight-vla.github.io.
Community
Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., "move gripper to the bowl", "lift upward", "pour the bottle"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: https://insight-vla.github.io/.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.24884 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.24884 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.24884 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.