<video src=\"https://cdn-uploads.huggingface.co/production/uploads/652d3023cdb2a91205709b6a/6ym670aekvI9S1oisWybG.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>\nDepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.","updatedAt":"2026-05-18T13:51:48.345Z","author":{"_id":"652d3023cdb2a91205709b6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d3023cdb2a91205709b6a/x-MaRkBwSdZ0xIYDJg09u.jpeg","fullname":"Hanxun Yu","name":"JonnyYu828","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.871517539024353},"editors":["JonnyYu828"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/652d3023cdb2a91205709b6a/x-MaRkBwSdZ0xIYDJg09u.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15876","authors":[{"_id":"6a0a7f4275184a0d71e02670","user":{"_id":"652d3023cdb2a91205709b6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d3023cdb2a91205709b6a/x-MaRkBwSdZ0xIYDJg09u.jpeg","isPro":false,"fullname":"Hanxun Yu","user":"JonnyYu828","type":"user","name":"JonnyYu828"},"name":"Hanxun Yu","status":"claimed_verified","statusLastChangedAt":"2026-05-18T07:46:32.807Z","hidden":false},{"_id":"6a0a7f4275184a0d71e02671","name":"Xuan Qu","hidden":false},{"_id":"6a0a7f4275184a0d71e02672","name":"Yuxin Wang","hidden":false},{"_id":"6a0a7f4275184a0d71e02673","name":"Jianke Zhu","hidden":false},{"_id":"6a0a7f4275184a0d71e02674","name":"Lei ke","hidden":false}],"publishedAt":"2026-05-15T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"Unlocking Dense Metric Depth Estimation in VLMs","submittedOnDailyBy":{"_id":"652d3023cdb2a91205709b6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d3023cdb2a91205709b6a/x-MaRkBwSdZ0xIYDJg09u.jpeg","isPro":false,"fullname":"Hanxun Yu","user":"JonnyYu828","type":"user","name":"JonnyYu828"},"summary":"Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.","upvotes":7,"discussionId":"6a0a7f4375184a0d71e02675","projectPage":"https://depthvlm.github.io/","githubRepo":"https://github.com/hanxunyu/DepthVLM","githubRepoAddedBy":"user","ai_summary":"DepthVLM enhances Vision-Language Models with dense geometry prediction through a lightweight depth head and unified vision-text supervision, achieving superior 3D spatial reasoning while maintaining multimodal capabilities.","ai_keywords":["Vision-Language Models","depth head","vision-text supervision","dense geometry","3D spatial reasoning","unified foundation model"],"githubStars":23,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"652d3023cdb2a91205709b6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d3023cdb2a91205709b6a/x-MaRkBwSdZ0xIYDJg09u.jpeg","isPro":false,"fullname":"Hanxun Yu","user":"JonnyYu828","type":"user"},{"_id":"665a0f7e5567c3e165c9ec57","avatarUrl":"/avatars/29a27b5f1aa9d5019d5ff2b09a7c1603.svg","isPro":false,"fullname":"Selecting","user":"Selecting","type":"user"},{"_id":"6258a6455ea3a0a9b6de3f22","avatarUrl":"/avatars/6eeed72a97fb24465e5e65583fbe50cf.svg","isPro":false,"fullname":"Lei Ke","user":"lkeab","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"66863d26e2b71e3d09189ae9","avatarUrl":"/avatars/3c0e6f30e053f2e622ae75e1dc43edba.svg","isPro":false,"fullname":"Song Wang","user":"songw-zju","type":"user"},{"_id":"67b0a66ed6e6e0b31fd9196e","avatarUrl":"/avatars/7286b249f7d0a7c261cf64b39bf6c2c1.svg","isPro":false,"fullname":"Takumi","user":"TaNakamura","type":"user"},{"_id":"69a3fe17e8c5d0ac05522727","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/HG1enuTPjXxvKPycdFoia.jpeg","isPro":false,"fullname":"Alexander Rivera","user":"asherthompson66","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15876.md"}">
Unlocking Dense Metric Depth Estimation in VLMs
Abstract
DepthVLM enhances Vision-Language Models with dense geometry prediction through a lightweight depth head and unified vision-text supervision, achieving superior 3D spatial reasoning while maintaining multimodal capabilities.
AI-generated summary
Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.
Community
DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.15876 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.