Hugging Face Daily Papers · · 5 min read

MAOAM: Unified Object and Material Selection with Vision-Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/670f5267d1b58394145c1ca3/FLaAfF4FiuHurSrgoGvjB.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/670f5267d1b58394145c1ca3/FLaAfF4FiuHurSrgoGvjB.png\" alt=\"fig\"></a></p>\n<p>In this work, we present MAOAM, a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.</p>\n","updatedAt":"2026-06-05T18:00:42.676Z","author":{"_id":"670f5267d1b58394145c1ca3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/t-YchgvZCbDW-plR8DZbA.png","fullname":"Jaden Park","name":"jpark677","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8592075109481812},"editors":["jpark677"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/t-YchgvZCbDW-plR8DZbA.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.04880","authors":[{"_id":"6a230da6e4c258a029491774","name":"Jaden Park","hidden":false},{"_id":"6a230da6e4c258a029491775","name":"Valentin Deschaintre","hidden":false},{"_id":"6a230da6e4c258a029491776","name":"Jason Kuen","hidden":false},{"_id":"6a230da6e4c258a029491777","name":"Kangning Liu","hidden":false},{"_id":"6a230da6e4c258a029491778","name":"Iliyan Georgiev","hidden":false},{"_id":"6a230da6e4c258a029491779","name":"Krishna Kumar Singh","hidden":false},{"_id":"6a230da6e4c258a02949177a","name":"Yong Jae Lee","hidden":false},{"_id":"6a230da6e4c258a02949177b","name":"Michael Fischer","hidden":false}],"publishedAt":"2026-06-02T00:00:00.000Z","submittedOnDailyAt":"2026-06-05T00:00:00.000Z","title":"MAOAM: Unified Object and Material Selection with Vision-Language Models","submittedOnDailyBy":{"_id":"670f5267d1b58394145c1ca3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/t-YchgvZCbDW-plR8DZbA.png","isPro":false,"fullname":"Jaden Park","user":"jpark677","type":"user","name":"jpark677"},"summary":"Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.","upvotes":6,"discussionId":"6a230da6e4c258a02949177c","projectPage":"https://jadenpark0.github.io/project_pages/maoam/","githubRepo":"https://github.com/adobe-research/obj-and-mat-selection","githubRepoAddedBy":"user","ai_summary":"A unified vision-language model framework enables precise object and material selection through text or click interactions, supporting diverse editing workflows with improved robustness.","ai_keywords":["vision-language-model","segmentation head","pixel-accurate masks","visual entities","spatial relations","multi-task objective","VQA task","emergent improvement","interaction modalities"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":1,"organization":{"_id":"637b318856db0404b7c5a0c2","name":"adobe-research","fullname":"Adobe Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669033410364-624bebf604abc7ebb01789af.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"670f5267d1b58394145c1ca3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/t-YchgvZCbDW-plR8DZbA.png","isPro":false,"fullname":"Jaden Park","user":"jpark677","type":"user"},{"_id":"6a2231c109002e837ab507ce","avatarUrl":"/avatars/abc6bb3cdd4d0b3948acff3499dde864.svg","isPro":false,"fullname":"viet-wics","user":"vietwics","type":"user"},{"_id":"634ef841de30ee20582b355a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/634ef841de30ee20582b355a/7W9HHzEjURmUPkQ7U_Nnl.png","isPro":true,"fullname":"Thao Nguyen","user":"thaoshibe","type":"user"},{"_id":"651ce91c610889eda1800463","avatarUrl":"/avatars/255e51c71464f549fe50ddb83de9f41c.svg","isPro":false,"fullname":"Jungtaek Kim","user":"jungtaekkim","type":"user"},{"_id":"6508b164abdde5290e5e4939","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6508b164abdde5290e5e4939/lQgAs3BHwCyI7Go1QA62m.jpeg","isPro":false,"fullname":"Harris Zhang","user":"HanSolo9682","type":"user"},{"_id":"677f8ec859ee993c8379d2a7","avatarUrl":"/avatars/e75ef1d17e8e4c8777a2818be281ac4e.svg","isPro":false,"fullname":"Hyunjung Lee","user":"hyulee","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"637b318856db0404b7c5a0c2","name":"adobe-research","fullname":"Adobe Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669033410364-624bebf604abc7ebb01789af.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.04880.md"}">
Papers
arxiv:2606.04880

MAOAM: Unified Object and Material Selection with Vision-Language Models

Published on Jun 2
· Submitted by
Jaden Park
on Jun 5
Authors:
,
,
,
,
,
,
,

Abstract

A unified vision-language model framework enables precise object and material selection through text or click interactions, supporting diverse editing workflows with improved robustness.

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.

Community

Paper submitter about 8 hours ago

fig

In this work, we present MAOAM, a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.04880
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.04880 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.04880 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.04880 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers