Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: Counterfactual Attribute Factuality Evaluation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our CAFE is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (SM), Context Conflict (CC), and Ontological Conflict (OC). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.</p>\n","updatedAt":"2026-05-14T08:40:09.195Z","author":{"_id":"666916f81fa4ef8b7c4fbffe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/666916f81fa4ef8b7c4fbffe/WK3yjdK2_i9sm9_woqcrZ.jpeg","fullname":"T. S. Liang","name":"teemosliang","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7974628210067749},"editors":["teemosliang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/666916f81fa4ef8b7c4fbffe/WK3yjdK2_i9sm9_woqcrZ.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.09591","authors":[{"_id":"6a05897eb1a8cbabc9f0897e","user":{"_id":"666916f81fa4ef8b7c4fbffe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/666916f81fa4ef8b7c4fbffe/WK3yjdK2_i9sm9_woqcrZ.jpeg","isPro":true,"fullname":"T. S. Liang","user":"teemosliang","type":"user","name":"teemosliang"},"name":"Shuang Liang","status":"claimed_verified","statusLastChangedAt":"2026-05-14T10:54:22.079Z","hidden":false},{"_id":"6a05897eb1a8cbabc9f0897f","name":"Zeqing Wang","hidden":false},{"_id":"6a05897eb1a8cbabc9f08980","name":"Yuxian Li","hidden":false},{"_id":"6a05897eb1a8cbabc9f08981","name":"Xihui Liu","hidden":false},{"_id":"6a05897eb1a8cbabc9f08982","name":"Han Wang","hidden":false}],"publishedAt":"2026-05-10T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"From Pixels to Concepts: Do Segmentation Models Understand What They Segment?","submittedOnDailyBy":{"_id":"666916f81fa4ef8b7c4fbffe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/666916f81fa4ef8b7c4fbffe/WK3yjdK2_i9sm9_woqcrZ.jpeg","isPro":true,"fullname":"T. S. Liang","user":"teemosliang","type":"user","name":"teemosliang"},"summary":"Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: Counterfactual Attribute Factuality Evaluation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our CAFE is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (SM), Context Conflict (CC), and Ontological Conflict (OC). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.","upvotes":1,"discussionId":"6a05897eb1a8cbabc9f08983","projectPage":"https://t-s-liang.github.io/CAFE","githubRepo":"https://github.com/T-S-Liang/CAFE","githubRepoAddedBy":"user","ai_summary":"CAFE is a new benchmark for evaluating concept-faithful segmentation in promptable models through attribute-level counterfactual manipulation, revealing that accurate mask prediction does not guarantee semantic grounding.","ai_keywords":["promptable segmentation models","Segment Anything Model 3","concept-guided localization","counterfactual manipulation","attribute-level manipulation","semantic grounding","mask prediction","shortcut learning"],"githubStars":4,"organization":{"_id":"67ea9ecfc234715db8dbf339","name":"hkuhk","fullname":"The University of Hong Kong","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ea9e8d2d95c10a0da11b0c/FNnR4M7YqKRuG43N5771B.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"661ab1f1fa3b144a381fa454","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/661ab1f1fa3b144a381fa454/IlpZBb9NCjo7ntFwMIH53.png","isPro":true,"fullname":"Urro","user":"urroxyz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67ea9ecfc234715db8dbf339","name":"hkuhk","fullname":"The University of Hong Kong","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/67ea9e8d2d95c10a0da11b0c/FNnR4M7YqKRuG43N5771B.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.09591.md"}">
From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
Abstract
CAFE is a new benchmark for evaluating concept-faithful segmentation in promptable models through attribute-level counterfactual manipulation, revealing that accurate mask prediction does not guarantee semantic grounding.
AI-generated summary
Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: Counterfactual Attribute Factuality Evaluation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our CAFE is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (SM), Context Conflict (CC), and Ontological Conflict (OC). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.
Community
Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: Counterfactual Attribute Factuality Evaluation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our CAFE is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (SM), Context Conflict (CC), and Ontological Conflict (OC). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.09591 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.09591 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.