Hugging Face Daily Papers · May 18, 2026 · 3 min read

Unlocking Dense Metric Depth Estimation in VLMs

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<video src=\"https://cdn-uploads.huggingface.co/production/uploads/652d3023cdb2a91205709b6a/6ym670aekvI9S1oisWybG.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>\nDepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.","updatedAt":"2026-05-18T13:51:48.345Z","author":{"_id":"652d3023cdb2a91205709b6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d3023cdb2a91205709b6a/x-MaRkBwSdZ0xIYDJg09u.jpeg","fullname":"Hanxun Yu","name":"JonnyYu828","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.871517539024353},"editors":["JonnyYu828"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/652d3023cdb2a91205709b6a/x-MaRkBwSdZ0xIYDJg09u.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15876","authors":[{"_id":"6a0a7f4275184a0d71e02670","user":{"_id":"652d3023cdb2a91205709b6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d3023cdb2a91205709b6a/x-MaRkBwSdZ0xIYDJg09u.jpeg","isPro":false,"fullname":"Hanxun Yu","user":"JonnyYu828","type":"user","name":"JonnyYu828"},"name":"Hanxun Yu","status":"claimed_verified","statusLastChangedAt":"2026-05-18T07:46:32.807Z","hidden":false},{"_id":"6a0a7f4275184a0d71e02671","name":"Xuan Qu","hidden":false},{"_id":"6a0a7f4275184a0d71e02672","name":"Yuxin Wang","hidden":false},{"_id":"6a0a7f4275184a0d71e02673","name":"Jianke Zhu","hidden":false},{"_id":"6a0a7f4275184a0d71e02674","name":"Lei ke","hidden":false}],"publishedAt":"2026-05-15T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"Unlocking Dense Metric Depth Estimation in VLMs","submittedOnDailyBy":{"_id":"652d3023cdb2a91205709b6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d3023cdb2a91205709b6a/x-MaRkBwSdZ0xIYDJg09u.jpeg","isPro":false,"fullname":"Hanxun Yu","user":"JonnyYu828","type":"user","name":"JonnyYu828"},"summary":"Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.","upvotes":7,"discussionId":"6a0a7f4375184a0d71e02675","projectPage":"https://depthvlm.github.io/","githubRepo":"https://github.com/hanxunyu/DepthVLM","githubRepoAddedBy":"user","ai_summary":"DepthVLM enhances Vision-Language Models with dense geometry prediction through a lightweight depth head and unified vision-text supervision, achieving superior 3D spatial reasoning while maintaining multimodal capabilities.","ai_keywords":["Vision-Language Models","depth head","vision-text supervision","dense geometry","3D spatial reasoning","unified foundation model"],"githubStars":23,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"652d3023cdb2a91205709b6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d3023cdb2a91205709b6a/x-MaRkBwSdZ0xIYDJg09u.jpeg","isPro":false,"fullname":"Hanxun Yu","user":"JonnyYu828","type":"user"},{"_id":"665a0f7e5567c3e165c9ec57","avatarUrl":"/avatars/29a27b5f1aa9d5019d5ff2b09a7c1603.svg","isPro":false,"fullname":"Selecting","user":"Selecting","type":"user"},{"_id":"6258a6455ea3a0a9b6de3f22","avatarUrl":"/avatars/6eeed72a97fb24465e5e65583fbe50cf.svg","isPro":false,"fullname":"Lei Ke","user":"lkeab","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"66863d26e2b71e3d09189ae9","avatarUrl":"/avatars/3c0e6f30e053f2e622ae75e1dc43edba.svg","isPro":false,"fullname":"Song Wang","user":"songw-zju","type":"user"},{"_id":"67b0a66ed6e6e0b31fd9196e","avatarUrl":"/avatars/7286b249f7d0a7c261cf64b39bf6c2c1.svg","isPro":false,"fullname":"Takumi","user":"TaNakamura","type":"user"},{"_id":"69a3fe17e8c5d0ac05522727","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/HG1enuTPjXxvKPycdFoia.jpeg","isPro":false,"fullname":"Alexander Rivera","user":"asherthompson66","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15876.md"}">

Papers

arxiv:2605.15876

Unlocking Dense Metric Depth Estimation in VLMs

Published on May 15

· Submitted by

Hanxun Yu on May 18

Tencent Hunyuan

Upvote

Authors:

Hanxun Yu ,

Abstract

DepthVLM enhances Vision-Language Models with dense geometry prediction through a lightweight depth head and unified vision-text supervision, achieving superior 3D spatial reasoning while maintaining multimodal capabilities.

AI-generated summary

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.

View arXiv page View PDF Project page GitHub 23 Add to collection

Community

JonnyYu828

Paper author Paper submitter about 23 hours ago

•

edited about 12 hours ago

DepthVLM serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.15876

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15876 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Unlocking Dense Metric Depth Estimation in VLMs

Abstract

Community

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers