Hugging Face Daily Papers · June 2, 2026 · 7 min read

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Humans glance at any object and instantly know where to act and how -- AFUN is an affordance foundation model that gives robots the same ability. From a single RGB-D image and a language command, it jointly predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to move), trained on one of the largest real-world affordance datasets to date, spanning robot, human, simulation, and 3D-scan sources. It takes a step toward open-world generalization, with state-of-the-art results across segmentation, contact-point, and 3D-motion benchmarks. And it deploys directly to a real robot, executing manipulation tasks like opening and grasping without any robot-specific fine-tuning.\n","updatedAt":"2026-06-02T17:26:46.901Z","author":{"_id":"685b8cc0aa442e9a67b5f2a2","avatarUrl":"/avatars/ae0a745d8b8275e3b5ebfcf985c6aede.svg","fullname":"Zhaoning Wang","name":"Zhaoningw","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.910729169845581},"editors":["Zhaoningw"],"editorAvatarUrls":["/avatars/ae0a745d8b8275e3b5ebfcf985c6aede.svg"],"reactions":[],"isReport":false}},{"id":"6a1f8a85b47e980051378292","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false},"createdAt":"2026-06-03T01:59:33.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances](https://huggingface.co/papers/2604.23249) (2026)\n* [Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance](https://huggingface.co/papers/2605.24203) (2026)\n* [Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation](https://huggingface.co/papers/2604.24681) (2026)\n* [Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments](https://huggingface.co/papers/2605.30280) (2026)\n* [Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation](https://huggingface.co/papers/2605.20085) (2026)\n* [Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations](https://huggingface.co/papers/2604.07517) (2026)\n* [AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation](https://huggingface.co/papers/2604.11674) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.23249\">BridgeACT: Bridging Human Demonstrations to Robot Actions via Unified Tool-Target Affordances</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.24203\">Afford-VLA: Action-Aligned Visual Planning via Internalized Affordance</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.24681\">Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30280\">Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.20085\">Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.07517\">Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.11674\">AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-06-03T01:59:33.527Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":360,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7227720022201538},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02551","authors":[{"_id":"6a1f02d0e292c1c78ecb1148","name":"Zhaoning Wang","hidden":false},{"_id":"6a1f02d0e292c1c78ecb1149","name":"Yi Zhong","hidden":false},{"_id":"6a1f02d0e292c1c78ecb114a","name":"Jiawei Fu","hidden":false},{"_id":"6a1f02d0e292c1c78ecb114b","name":"Henrik I. Christensen","hidden":false},{"_id":"6a1f02d0e292c1c78ecb114c","name":"Jun Gao","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"AFUN: Towards an Affordance Foundation Model for Functionality Understanding","submittedOnDailyBy":{"_id":"685b8cc0aa442e9a67b5f2a2","avatarUrl":"/avatars/ae0a745d8b8275e3b5ebfcf985c6aede.svg","isPro":false,"fullname":"Zhaoning Wang","user":"Zhaoningw","type":"user","name":"Zhaoningw"},"summary":"Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN","upvotes":6,"discussionId":"6a1f02d1e292c1c78ecb114d","projectPage":"https://www.zhaoningwang.com/AFUN/","githubRepo":"https://github.com/EricWang12/AFUN","githubRepoAddedBy":"user","ai_summary":"Affordance understanding model predicts functional masks and 3D motion curves from RGB-D observations and language descriptions, enabling generalizable robot manipulation across diverse environments.","ai_keywords":["affordance foundation model","task-conditional functional mask","3D post-contact motion curve","RGB-D observation","language task description","open-world generalization","functional mask","contact-point prediction","3D motion prediction"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"63df4874e742e86dc925d67c","name":"umich","fullname":"University of Michigan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675577443573-63df328115266dd945fc01f4.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"685b8cc0aa442e9a67b5f2a2","avatarUrl":"/avatars/ae0a745d8b8275e3b5ebfcf985c6aede.svg","isPro":false,"fullname":"Zhaoning Wang","user":"Zhaoningw","type":"user"},{"_id":"646eebbacb6ea6e6b6e1f5f5","avatarUrl":"/avatars/a4adaf7a63c74ffd9115d0f78df38071.svg","isPro":false,"fullname":"Jiu Feng","user":"JanusVon","type":"user"},{"_id":"65e1c1cb7fd901af96753f93","avatarUrl":"/avatars/5853f9b93aca39074cec6efd29695d7d.svg","isPro":false,"fullname":"chen siyi","user":"csusupergear","type":"user"},{"_id":"654016a6cdc9c22e35a17829","avatarUrl":"/avatars/c81f84567656cce6c21b994b91de4100.svg","isPro":false,"fullname":"Kevin Xu","user":"KevinXu02","type":"user"},{"_id":"633b7a4b0d68f86e2d98de05","avatarUrl":"/avatars/5d48c171ddbcc7ca39bdc0d11c6224e4.svg","isPro":false,"fullname":"Jun Gao","user":"JungaoCanada","type":"user"},{"_id":"691ba6d93560b8c8cb6c4d45","avatarUrl":"/avatars/f57108af9258171fa01e18d0ba854905.svg","isPro":false,"fullname":"Yi Zhong","user":"yeszhong","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63df4874e742e86dc925d67c","name":"umich","fullname":"University of Michigan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675577443573-63df328115266dd945fc01f4.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02551.md"}">

Papers

arxiv:2606.02551

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

Published on Jun 1

· Submitted by

Zhaoning Wang on Jun 2

University of Michigan

Upvote

Authors:

Abstract

Affordance understanding model predicts functional masks and 3D motion curves from RGB-D observations and language descriptions, enabling generalizable robot manipulation across diverse environments.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

Zhaoningw

Paper submitter about 9 hours ago

librarian-bot

1 minute ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.02551

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.02551 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.02551 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02551 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers