Humans glance at any object and instantly know where to act and how -- AFUN is an affordance foundation model that gives robots the same ability. From a single RGB-D image and a language command, it jointly predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to move), trained on one of the largest real-world affordance datasets to date, spanning robot, human, simulation, and 3D-scan sources. It takes a step toward open-world generalization, with state-of-the-art results across segmentation, contact-point, and 3D-motion benchmarks. And it deploys directly to a real robot, executing manipulation tasks like opening and grasping without any robot-specific fine-tuning.</p>\n","updatedAt":"2026-06-02T17:26:46.901Z","author":{"_id":"685b8cc0aa442e9a67b5f2a2","avatarUrl":"/avatars/ae0a745d8b8275e3b5ebfcf985c6aede.svg","fullname":"Zhaoning Wang","name":"Zhaoningw","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.910729169845581},"editors":["Zhaoningw"],"editorAvatarUrls":["/avatars/ae0a745d8b8275e3b5ebfcf985c6aede.svg"],"reactions":[],"isReport":false}},{"id":"6a1f8a85b47e980051378292","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false},"createdAt":"2026-06-03T01:59:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances](https://huggingface.co/papers/2604.23249) (2026)\n* [Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance](https://huggingface.co/papers/2605.24203) (2026)\n* [Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation](https://huggingface.co/papers/2604.24681) (2026)\n* [Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments](https://huggingface.co/papers/2605.30280) (2026)\n* [Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation](https://huggingface.co/papers/2605.20085) (2026)\n* [Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations](https://huggingface.co/papers/2604.07517) (2026)\n* [AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation](https://huggingface.co/papers/2604.11674) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.23249\">BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.24203\">Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.24681\">Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30280\">Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.20085\">Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.07517\">Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.11674\">AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-06-03T01:59:33.527Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7227720022201538},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02551","authors":[{"_id":"6a1f02d0e292c1c78ecb1148","name":"Zhaoning Wang","hidden":false},{"_id":"6a1f02d0e292c1c78ecb1149","name":"Yi Zhong","hidden":false},{"_id":"6a1f02d0e292c1c78ecb114a","name":"Jiawei Fu","hidden":false},{"_id":"6a1f02d0e292c1c78ecb114b","name":"Henrik I. Christensen","hidden":false},{"_id":"6a1f02d0e292c1c78ecb114c","name":"Jun Gao","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"AFUN: Towards an Affordance Foundation Model for Functionality Understanding","submittedOnDailyBy":{"_id":"685b8cc0aa442e9a67b5f2a2","avatarUrl":"/avatars/ae0a745d8b8275e3b5ebfcf985c6aede.svg","isPro":false,"fullname":"Zhaoning Wang","user":"Zhaoningw","type":"user","name":"Zhaoningw"},"summary":"Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN","upvotes":6,"discussionId":"6a1f02d1e292c1c78ecb114d","projectPage":"https://www.zhaoningwang.com/AFUN/","githubRepo":"https://github.com/EricWang12/AFUN","githubRepoAddedBy":"user","ai_summary":"Affordance understanding model predicts functional masks and 3D motion curves from RGB-D observations and language descriptions, enabling generalizable robot manipulation across diverse environments.","ai_keywords":["affordance foundation model","task-conditional functional mask","3D post-contact motion curve","RGB-D observation","language task description","open-world generalization","functional mask","contact-point prediction","3D motion prediction"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"63df4874e742e86dc925d67c","name":"umich","fullname":"University of Michigan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675577443573-63df328115266dd945fc01f4.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"685b8cc0aa442e9a67b5f2a2","avatarUrl":"/avatars/ae0a745d8b8275e3b5ebfcf985c6aede.svg","isPro":false,"fullname":"Zhaoning Wang","user":"Zhaoningw","type":"user"},{"_id":"646eebbacb6ea6e6b6e1f5f5","avatarUrl":"/avatars/a4adaf7a63c74ffd9115d0f78df38071.svg","isPro":false,"fullname":"Jiu Feng","user":"JanusVon","type":"user"},{"_id":"65e1c1cb7fd901af96753f93","avatarUrl":"/avatars/5853f9b93aca39074cec6efd29695d7d.svg","isPro":false,"fullname":"chen siyi","user":"csusupergear","type":"user"},{"_id":"654016a6cdc9c22e35a17829","avatarUrl":"/avatars/c81f84567656cce6c21b994b91de4100.svg","isPro":false,"fullname":"Kevin Xu","user":"KevinXu02","type":"user"},{"_id":"633b7a4b0d68f86e2d98de05","avatarUrl":"/avatars/5d48c171ddbcc7ca39bdc0d11c6224e4.svg","isPro":false,"fullname":"Jun Gao","user":"JungaoCanada","type":"user"},{"_id":"691ba6d93560b8c8cb6c4d45","avatarUrl":"/avatars/f57108af9258171fa01e18d0ba854905.svg","isPro":false,"fullname":"Yi Zhong","user":"yeszhong","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63df4874e742e86dc925d67c","name":"umich","fullname":"University of Michigan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675577443573-63df328115266dd945fc01f4.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02551.md"}">
AFUN: Towards an Affordance Foundation Model for Functionality Understanding
Abstract
Affordance understanding model predicts functional masks and 3D motion curves from RGB-D observations and language descriptions, enabling generalizable robot manipulation across diverse environments.
Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN
Community
Humans glance at any object and instantly know where to act and how -- AFUN is an affordance foundation model that gives robots the same ability. From a single RGB-D image and a language command, it jointly predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to move), trained on one of the largest real-world affordance datasets to date, spanning robot, human, simulation, and 3D-scan sources. It takes a step toward open-world generalization, with state-of-the-art results across segmentation, contact-point, and 3D-motion benchmarks. And it deploys directly to a real robot, executing manipulation tasks like opening and grasping without any robot-specific fine-tuning.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.02551 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.02551 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.02551 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.