Hugging Face Daily Papers · · 6 min read

Towards Consistent Video Geometry Estimation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.</p>\n","updatedAt":"2026-05-29T09:49:22.680Z","author":{"_id":"66972cbf57a5a55a1f3da45b","avatarUrl":"/avatars/dfdd7689fc6ceb7e42e2b984d49d35c4.svg","fullname":"zhu","name":"pkqbajng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8654366135597229},"editors":["pkqbajng"],"editorAvatarUrls":["/avatars/dfdd7689fc6ceb7e42e2b984d49d35c4.svg"],"reactions":[],"isReport":false}},{"id":"6a1a41a7447ed909ef243522","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:47:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Stabilizing Streaming Video Geometry via Dynamic Feature Normalization](https://huggingface.co/papers/2605.25308) (2026)\n* [Geometric Context Transformer for Streaming 3D Reconstruction](https://huggingface.co/papers/2604.14141) (2026)\n* [Large Depth Completion Model from Sparse Observations](https://huggingface.co/papers/2605.30115) (2026)\n* [Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation](https://huggingface.co/papers/2604.21713) (2026)\n* [GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth](https://huggingface.co/papers/2605.10525) (2026)\n* [VDPP: Video Depth Post-Processing for Speed and Scalability](https://huggingface.co/papers/2604.06665) (2026)\n* [Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction](https://huggingface.co/papers/2604.08542) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.25308\">Stabilizing Streaming Video Geometry via Dynamic Feature Normalization</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.14141\">Geometric Context Transformer for Streaming 3D Reconstruction</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30115\">Large Depth Completion Model from Sparse Observations</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.21713\">Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.10525\">GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.06665\">VDPP: Video Depth Post-Processing for Speed and Scalability</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.08542\">Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:47:19.236Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6861849427223206},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30060","authors":[{"_id":"6a18fda856b4bb14ec65cee0","user":{"_id":"66972cbf57a5a55a1f3da45b","avatarUrl":"/avatars/dfdd7689fc6ceb7e42e2b984d49d35c4.svg","isPro":false,"fullname":"zhu","user":"pkqbajng","type":"user","name":"pkqbajng"},"name":"Zhu Yu","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:50:58.711Z","hidden":false},{"_id":"6a18fda856b4bb14ec65cee1","name":"Jingnan Gao","hidden":false},{"_id":"6a18fda856b4bb14ec65cee2","name":"Runmin Zhang","hidden":false},{"_id":"6a18fda856b4bb14ec65cee3","name":"Lingteng Qiu","hidden":false},{"_id":"6a18fda856b4bb14ec65cee4","name":"Zhengyi Zhao","hidden":false},{"_id":"6a18fda856b4bb14ec65cee5","name":"Rui Peng","hidden":false},{"_id":"6a18fda856b4bb14ec65cee6","name":"Yichao Yan","hidden":false},{"_id":"6a18fda856b4bb14ec65cee7","name":"Kejie Qiu","hidden":false},{"_id":"6a18fda856b4bb14ec65cee8","name":"Siyu Zhu","hidden":false},{"_id":"6a18fda856b4bb14ec65cee9","name":"Si-Yuan Cao","hidden":false},{"_id":"6a18fda856b4bb14ec65ceea","name":"Hui-Liang Shen","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"Towards Consistent Video Geometry Estimation","submittedOnDailyBy":{"_id":"66972cbf57a5a55a1f3da45b","avatarUrl":"/avatars/dfdd7689fc6ceb7e42e2b984d49d35c4.svg","isPro":false,"fullname":"zhu","user":"pkqbajng","type":"user","name":"pkqbajng"},"summary":"This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.","upvotes":0,"discussionId":"6a18fda856b4bb14ec65ceeb","projectPage":"https://pkqbajng.github.io/ViGeo/","githubRepo":"https://github.com/aigc3d/ViGeo","githubRepoAddedBy":"user","ai_summary":"ViGeo is a transformer-based foundation model that recovers dense and consistent 3D geometry from videos using dynamic chunking attention and a completion-based data refinement framework.","ai_keywords":["transformer architecture","dynamic chunking attention","video depth completion","temporal consistency","geometric reliability","surface normals","video point map estimation","depth estimation","multi-view context","sparse annotations"],"githubStars":46,"organization":{"_id":"6345aadf5efccdc07f1365a5","name":"ZhejiangUniversity","fullname":"Zhejiang University","avatar":"https://www.gravatar.com/avatar/d1d414628877bec2958f95ad283c15e7?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"6345aadf5efccdc07f1365a5","name":"ZhejiangUniversity","fullname":"Zhejiang University","avatar":"https://www.gravatar.com/avatar/d1d414628877bec2958f95ad283c15e7?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30060.md"}">
Papers
arxiv:2605.30060

Towards Consistent Video Geometry Estimation

Published on May 28
· Submitted by
zhu
on May 29
Authors:
,
,
,
,
,
,
,
,
,

Abstract

ViGeo is a transformer-based foundation model that recovers dense and consistent 3D geometry from videos using dynamic chunking attention and a completion-based data refinement framework.

AI-generated summary

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.

Community

Paper author Paper submitter 1 day ago

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30060
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30060 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30060 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers