Hugging Face Daily Papers · June 24, 2026 · 5 min read

InSight: Self-Guided Skill Acquisition via Steerable VLAs

#model-release #multimodal #acquisition #robotics

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., \"move gripper to the bowl\", \"lift upward\", \"pour the bottle\"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: <a href=\"https://insight-vla.github.io/\" rel=\"nofollow\">https://insight-vla.github.io/</a>.</p>\n","updatedAt":"2026-06-24T19:33:14.495Z","author":{"_id":"691bac894311763d9db471c2","avatarUrl":"/avatars/d4cf231224b0f992f5d2460f94762421.svg","fullname":"Maggie","name":"maggi3wang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9003055095672607},"editors":["maggi3wang"],"editorAvatarUrls":["/avatars/d4cf231224b0f992f5d2460f94762421.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.24884","authors":[{"_id":"6a3c2ff111420f88bdc06067","name":"Maggie Wang","hidden":false},{"_id":"6a3c2ff111420f88bdc06068","name":"Lars Osterberg","hidden":false},{"_id":"6a3c2ff111420f88bdc06069","name":"Stephen Tian","hidden":false},{"_id":"6a3c2ff111420f88bdc0606a","name":"Ola Shorinwa","hidden":false},{"_id":"6a3c2ff111420f88bdc0606b","name":"Jiajun Wu","hidden":false},{"_id":"6a3c2ff111420f88bdc0606c","name":"Mac Schwager","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/691bac894311763d9db471c2/kcqClpzj9K8O4jaMfShsy.png"],"publishedAt":"2026-06-23T00:00:00.000Z","submittedOnDailyAt":"2026-06-24T00:00:00.000Z","title":"InSight: Self-Guided Skill Acquisition via Steerable VLAs","submittedOnDailyBy":{"_id":"691bac894311763d9db471c2","avatarUrl":"/avatars/d4cf231224b0f992f5d2460f94762421.svg","isPro":false,"fullname":"Maggie","user":"maggi3wang","type":"user","name":"maggi3wang"},"summary":"Vision-language-action (VLA) models can learn manipulation skills from demonstrations, but their capabilities are bounded by the skills in the training data. We present InSight, a framework that unlocks autonomous skill acquisition by rendering VLAs steerable at the primitive-action level (e.g., \"move gripper to the bowl\", \"lift upward\", \"pour the bottle\"). InSight consists of two primary stages: (1) an automated segmentation pipeline that partitions demonstrations into labeled primitives via VLM plan decomposition and end-effector poses to enable VLA primitive steerability, and (2) a VLM-guided data flywheel that identifies missing primitives required to accomplish a novel task, autonomously attempts demonstrations of the missing primitives with VLM-proposed low-level control, and automatically labels, stores, and integrates successful demonstrations into the VLA training set. We evaluate InSight across simulation and real-world manipulation tasks, including block flipping, drawer closing, sweeping, twisting, and pouring, without any human demonstrations of these target skills. Once learned, these primitives can be composed to execute novel, long-horizon tasks without additional human demonstrations. Our findings demonstrate that primitive steerability provides a practical foundation for continual skill acquisition in VLA policies. Project website: https://insight-vla.github.io.","upvotes":0,"discussionId":"6a3c2ff111420f88bdc0606d","projectPage":"https://insight-vla.github.io/","githubRepo":"https://github.com/insight-vla/insight","githubRepoAddedBy":"user","ai_summary":"InSight enables autonomous skill acquisition for vision-language-action models through primitive-action level steerability and automated demonstration generation.","ai_keywords":["Vision-language-action models","primitive-action level steerability","automated segmentation pipeline","VLM plan decomposition","end-effector poses","VLM-guided data flywheel","low-level control","continual skill acquisition"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6,"organization":{"_id":"672c672dcf09d152f4da04c4","name":"StanfordUniversity","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/vJI0POlzGMXL2878t1vz2.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"672c672dcf09d152f4da04c4","name":"StanfordUniversity","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/vJI0POlzGMXL2878t1vz2.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.24884.md","query":{}}">

Papers

arxiv:2606.24884

InSight: Self-Guided Skill Acquisition via Steerable VLAs

Published on Jun 23

· Submitted by

Maggie on Jun 24

Stanford University

Upvote

Authors:

Abstract

InSight enables autonomous skill acquisition for vision-language-action models through primitive-action level steerability and automated demonstration generation.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

View arXiv page View PDF Project page GitHub 6 Add to collection

Community

maggi3wang

Paper submitter about 5 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.24884

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.24884 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.24884 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.24884 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

InSight: Self-Guided Skill Acquisition via Steerable VLAs

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers