Hugging Face Daily Papers · · 5 min read

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.15876\">Unlocking Dense Metric Depth Estimation in VLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.05695\">Let Geometry GUIDE: Layer-wise Unrolling of Geometric Priors in Multimodal LLMs</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.02546\">Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.29416\">3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25334\">Dual-Pathway Geometry-Aware MLLM for Spatial Intelligence</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.10485\">VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.09449\">SpaceMind++: Toward Allocentric Cognitive Maps for Spatially Grounded Video MLLMs</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:43:41.351Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7062179446220398},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30231","authors":[{"_id":"6a18fdfd56b4bb14ec65ceee","user":{"_id":"6499eca0685215f7247bd5ce","avatarUrl":"/avatars/b6fea0c33c3c930c7314b99b414072a9.svg","isPro":false,"fullname":"Chun-Hsiao Yeh","user":"danielchyeh","type":"user","name":"danielchyeh"},"name":"Chun-Hsiao Yeh","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:50:56.832Z","hidden":false},{"_id":"6a18fdfd56b4bb14ec65ceef","name":"Shengyi Qian","hidden":false},{"_id":"6a18fdfd56b4bb14ec65cef0","name":"Manchen Wang","hidden":false},{"_id":"6a18fdfd56b4bb14ec65cef1","name":"Yi Ma","hidden":false},{"_id":"6a18fdfd56b4bb14ec65cef2","name":"Joseph Tighe","hidden":false},{"_id":"6a18fdfd56b4bb14ec65cef3","name":"Fanyi Xiao","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning","submittedOnDailyBy":{"_id":"6499eca0685215f7247bd5ce","avatarUrl":"/avatars/b6fea0c33c3c930c7314b99b414072a9.svg","isPro":false,"fullname":"Chun-Hsiao Yeh","user":"danielchyeh","type":"user","name":"danielchyeh"},"summary":"Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.","upvotes":1,"discussionId":"6a18fdfe56b4bb14ec65cef4","projectPage":"https://danielchyeh.github.io/GASP/","ai_summary":"Training Vision-Language Models with geometric priors improves 3D spatial reasoning through deep supervision with contrastive loss and depth consistency, achieving better performance than standard fine-tuning approaches.","ai_keywords":["Vision-Language Models","3D spatial reasoning","fine-tuning","3D visual question-answering","geometric priors","transformer layers","correspondence head","deep supervision","contrastive loss","depth consistency","All-Angles Bench","VSI-Bench"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6499eca0685215f7247bd5ce","avatarUrl":"/avatars/b6fea0c33c3c930c7314b99b414072a9.svg","isPro":false,"fullname":"Chun-Hsiao Yeh","user":"danielchyeh","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30231.md"}">
Papers
arxiv:2605.30231

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Published on May 28
· Submitted by
Chun-Hsiao Yeh
on May 29
Authors:
,
,
,
,

Abstract

Training Vision-Language Models with geometric priors improves 3D spatial reasoning through deep supervision with contrastive loss and depth consistency, achieving better performance than standard fine-tuning approaches.

AI-generated summary

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30231
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30231 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30231 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30231 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers