A controlled framework for evaluating whether VLMs know when not to answer spatial questions.</p>\n","updatedAt":"2026-06-01T05:20:11.167Z","author":{"_id":"646cd3cdb91221bd20a43fe5","avatarUrl":"/avatars/53e47b1549993d8c04f95e9c60d59a7f.svg","fullname":"Yue Zhang","name":"Yuezhangjoslin","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8464193344116211},"editors":["Yuezhangjoslin"],"editorAvatarUrls":["/avatars/53e47b1549993d8c04f95e9c60d59a7f.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30557","authors":[{"_id":"6a1cfa32808ddbc3c7d434e3","name":"Yue Zhang","hidden":false},{"_id":"6a1cfa32808ddbc3c7d434e4","name":"Zun Wang","hidden":false},{"_id":"6a1cfa32808ddbc3c7d434e5","name":"Han Lin","hidden":false},{"_id":"6a1cfa32808ddbc3c7d434e6","name":"Yonatan Bitton","hidden":false},{"_id":"6a1cfa32808ddbc3c7d434e7","name":"Idan Szpektor","hidden":false},{"_id":"6a1cfa32808ddbc3c7d434e8","name":"Mohit Bansal","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?","submittedOnDailyBy":{"_id":"646cd3cdb91221bd20a43fe5","avatarUrl":"/avatars/53e47b1549993d8c04f95e9c60d59a7f.svg","isPro":false,"fullname":"Yue Zhang","user":"Yuezhangjoslin","type":"user","name":"Yuezhangjoslin"},"summary":"Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\\% under occlusion and below 10\\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.","upvotes":1,"discussionId":"6a1cfa32808ddbc3c7d434e9","projectPage":"https://zhangyuejoslin.github.io/spatialuncertain/","githubRepo":"https://github.com/zhangyuejoslin/SpatialUncertain_code","githubRepoAddedBy":"user","ai_summary":"Vision-language models exhibit overconfidence in spatial reasoning tasks and struggle to identify when additional observations are needed to resolve uncertainty.","ai_keywords":["spatial reasoning","vision-language models","occlusion","perspective ambiguity","abstention","visual evidence","additional viewpoints"],"githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"646cd3cdb91221bd20a43fe5","avatarUrl":"/avatars/53e47b1549993d8c04f95e9c60d59a7f.svg","isPro":false,"fullname":"Yue Zhang","user":"Yuezhangjoslin","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30557.md"}">
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
Abstract
Vision-language models exhibit overconfidence in spatial reasoning tasks and struggle to identify when additional observations are needed to resolve uncertainty.
AI-generated summary
Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.
Community
A controlled framework for evaluating whether VLMs know when not to answer spatial questions.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.30557 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.30557 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.30557 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.