Hugging Face Daily Papers · June 5, 2026 · 4 min read

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

AffordanceVLA introduces a structured affordance-forecasting bridge for VLA models, enabling robots to reason about what to manipulate, where to interact, and how to act for more robust instruction-following manipulation.</p>\n","updatedAt":"2026-06-05T15:41:05.465Z","author":{"_id":"692fa5d17ff1da99eb783dfb","avatarUrl":"/avatars/5477343d26250cbda7babb8f1fdee49d.svg","fullname":"Qize Yu","name":"Skywalker0410","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8529626131057739},"editors":["Skywalker0410"],"editorAvatarUrls":["/avatars/5477343d26250cbda7babb8f1fdee49d.svg"],"reactions":[{"reaction":"🤗","users":["Skywalker0410","hellouniverse"],"count":2}],"isReport":false}},{"id":"6a22f1391b95e49c2fa18fe9","author":{"_id":"65144605be453924e0519d9d","avatarUrl":"/avatars/763446333a2270abaedfdb26041370cb.svg","fullname":"huang","name":"hellouniverse","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-05T15:54:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Excellent work with plenty of insightful takeaways.","html":"<p>Excellent work with plenty of insightful takeaways.</p>\n","updatedAt":"2026-06-05T15:54:33.085Z","author":{"_id":"65144605be453924e0519d9d","avatarUrl":"/avatars/763446333a2270abaedfdb26041370cb.svg","fullname":"huang","name":"hellouniverse","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8757949471473694},"editors":["hellouniverse"],"editorAvatarUrls":["/avatars/763446333a2270abaedfdb26041370cb.svg"],"reactions":[{"reaction":"❤️","users":["Skywalker0410"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.06155","authors":[{"_id":"6a22426b3490a593e87b14f2","user":{"_id":"692fa5d17ff1da99eb783dfb","avatarUrl":"/avatars/5477343d26250cbda7babb8f1fdee49d.svg","isPro":false,"fullname":"Qize Yu","user":"Skywalker0410","type":"user","name":"Skywalker0410"},"name":"Qize Yu","status":"claimed_verified","statusLastChangedAt":"2026-06-05T15:06:41.260Z","hidden":false},{"_id":"6a22426b3490a593e87b14f3","name":"Jiadi You","hidden":false},{"_id":"6a22426b3490a593e87b14f4","name":"Yuran Wang","hidden":false},{"_id":"6a22426b3490a593e87b14f5","name":"Jiaqi Liang","hidden":false},{"_id":"6a22426b3490a593e87b14f6","name":"Bowen Ping","hidden":false},{"_id":"6a22426b3490a593e87b14f7","name":"Yang Tian","hidden":false},{"_id":"6a22426b3490a593e87b14f8","name":"Yue Chen","hidden":false},{"_id":"6a22426b3490a593e87b14f9","name":"Minghong Cai","hidden":false},{"_id":"6a22426b3490a593e87b14fa","name":"Zeying Gong","hidden":false},{"_id":"6a22426b3490a593e87b14fb","name":"Ruihai Wu","hidden":false},{"_id":"6a22426b3490a593e87b14fc","name":"Yinchuan Li","hidden":false},{"_id":"6a22426b3490a593e87b14fd","name":"Junwei Liang","hidden":false},{"_id":"6a22426b3490a593e87b14fe","name":"Yingcong Chen","hidden":false}],"publishedAt":"2026-06-04T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding","submittedOnDailyBy":{"_id":"692fa5d17ff1da99eb783dfb","avatarUrl":"/avatars/5477343d26250cbda7babb8f1fdee49d.svg","isPro":false,"fullname":"Qize Yu","user":"Skywalker0410","type":"user","name":"Skywalker0410"},"summary":"Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose AffordanceVLA, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) Which2Act for object-centric grounding via visual latent prediction to suppress distractions; 2) Where2Act for 2D interaction localization via affordance map estimation; and 3) How2Act for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.","upvotes":3,"discussionId":"6a22426c3490a593e87b14ff","projectPage":"https://skywalker-yqz.github.io/AffordanceVLA/","githubRepo":"https://github.com/Skywalker-yqz/AffordanceVLA","githubRepoAddedBy":"user","ai_summary":"AffordanceVLA introduces a unified framework that uses structured affordance forecasting as an intermediate representation to improve the precision of perception-action mapping in robotic manipulation by leveraging vision-language models.","ai_keywords":["Vision-Language-Action models","vision-language models","embodied control policies","affordance forecasting","visual latent prediction","affordance map estimation","3D geometric reasoning","Mixture-of-Transformer","automated data augmentation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"692fa5d17ff1da99eb783dfb","avatarUrl":"/avatars/5477343d26250cbda7babb8f1fdee49d.svg","isPro":false,"fullname":"Qize Yu","user":"Skywalker0410","type":"user"},{"_id":"6844057801bb8ad58ca2bc17","avatarUrl":"/avatars/a4eb908a3d3bfdd424ea74e5a93aadf7.svg","isPro":false,"fullname":"Nimol","user":"Nimolty","type":"user"},{"_id":"6744b49365b98acef35a2e02","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/7jF2GSdosob8i--tRBxSQ.png","isPro":false,"fullname":"chen-boyu","user":"chen-by","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.06155.md"}">

Papers

arxiv:2606.06155

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

Published on Jun 4

· Submitted by

Qize Yu on Jun 5

Peking University

Upvote

Authors:

Qize Yu ,

Abstract

AffordanceVLA introduces a unified framework that uses structured affordance forecasting as an intermediate representation to improve the precision of perception-action mapping in robotic manipulation by leveraging vision-language models.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose AffordanceVLA, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) Which2Act for object-centric grounding via visual latent prediction to suppress distractions; 2) Where2Act for 2D interaction localization via affordance map estimation; and 3) How2Act for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

Skywalker0410

Paper author Paper submitter about 10 hours ago

hellouniverse

about 10 hours ago

Excellent work with plenty of insightful takeaways.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.06155

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.06155 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.06155 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.06155 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers