This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.</p>\n","updatedAt":"2026-05-29T09:49:22.680Z","author":{"_id":"66972cbf57a5a55a1f3da45b","avatarUrl":"/avatars/dfdd7689fc6ceb7e42e2b984d49d35c4.svg","fullname":"zhu","name":"pkqbajng","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8654366135597229},"editors":["pkqbajng"],"editorAvatarUrls":["/avatars/dfdd7689fc6ceb7e42e2b984d49d35c4.svg"],"reactions":[],"isReport":false}},{"id":"6a1a41a7447ed909ef243522","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:47:19.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Stabilizing Streaming Video Geometry via Dynamic Feature Normalization](https://huggingface.co/papers/2605.25308) (2026)\n* [Geometric Context Transformer for Streaming 3D Reconstruction](https://huggingface.co/papers/2604.14141) (2026)\n* [Large Depth Completion Model from Sparse Observations](https://huggingface.co/papers/2605.30115) (2026)\n* [Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation](https://huggingface.co/papers/2604.21713) (2026)\n* [GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth](https://huggingface.co/papers/2605.10525) (2026)\n* [VDPP: Video Depth Post-Processing for Speed and Scalability](https://huggingface.co/papers/2604.06665) (2026)\n* [Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction](https://huggingface.co/papers/2604.08542) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.25308\">Stabilizing Streaming Video Geometry via Dynamic Feature Normalization</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.14141\">Geometric Context Transformer for Streaming 3D Reconstruction</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.30115\">Large Depth Completion Model from Sparse Observations</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.21713\">Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.10525\">GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.06665\">VDPP: Video Depth Post-Processing for Speed and Scalability</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.08542\">Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:47:19.236Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6861849427223206},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30060","authors":[{"_id":"6a18fda856b4bb14ec65cee0","user":{"_id":"66972cbf57a5a55a1f3da45b","avatarUrl":"/avatars/dfdd7689fc6ceb7e42e2b984d49d35c4.svg","isPro":false,"fullname":"zhu","user":"pkqbajng","type":"user","name":"pkqbajng"},"name":"Zhu Yu","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:50:58.711Z","hidden":false},{"_id":"6a18fda856b4bb14ec65cee1","name":"Jingnan Gao","hidden":false},{"_id":"6a18fda856b4bb14ec65cee2","name":"Runmin Zhang","hidden":false},{"_id":"6a18fda856b4bb14ec65cee3","name":"Lingteng Qiu","hidden":false},{"_id":"6a18fda856b4bb14ec65cee4","name":"Zhengyi Zhao","hidden":false},{"_id":"6a18fda856b4bb14ec65cee5","name":"Rui Peng","hidden":false},{"_id":"6a18fda856b4bb14ec65cee6","name":"Yichao Yan","hidden":false},{"_id":"6a18fda856b4bb14ec65cee7","name":"Kejie Qiu","hidden":false},{"_id":"6a18fda856b4bb14ec65cee8","name":"Siyu Zhu","hidden":false},{"_id":"6a18fda856b4bb14ec65cee9","name":"Si-Yuan Cao","hidden":false},{"_id":"6a18fda856b4bb14ec65ceea","name":"Hui-Liang Shen","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"Towards Consistent Video Geometry Estimation","submittedOnDailyBy":{"_id":"66972cbf57a5a55a1f3da45b","avatarUrl":"/avatars/dfdd7689fc6ceb7e42e2b984d49d35c4.svg","isPro":false,"fullname":"zhu","user":"pkqbajng","type":"user","name":"pkqbajng"},"summary":"This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.","upvotes":0,"discussionId":"6a18fda856b4bb14ec65ceeb","projectPage":"https://pkqbajng.github.io/ViGeo/","githubRepo":"https://github.com/aigc3d/ViGeo","githubRepoAddedBy":"user","ai_summary":"ViGeo is a transformer-based foundation model that recovers dense and consistent 3D geometry from videos using dynamic chunking attention and a completion-based data refinement framework.","ai_keywords":["transformer architecture","dynamic chunking attention","video depth completion","temporal consistency","geometric reliability","surface normals","video point map estimation","depth estimation","multi-view context","sparse annotations"],"githubStars":46,"organization":{"_id":"6345aadf5efccdc07f1365a5","name":"ZhejiangUniversity","fullname":"Zhejiang University","avatar":"https://www.gravatar.com/avatar/d1d414628877bec2958f95ad283c15e7?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"6345aadf5efccdc07f1365a5","name":"ZhejiangUniversity","fullname":"Zhejiang University","avatar":"https://www.gravatar.com/avatar/d1d414628877bec2958f95ad283c15e7?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30060.md"}">
Towards Consistent Video Geometry Estimation
Published on May 28
· Submitted by zhu on May 29 Authors: ,
,
,
,
,
,
,
,
,
Abstract
ViGeo is a transformer-based foundation model that recovers dense and consistent 3D geometry from videos using dynamic chunking attention and a completion-based data refinement framework.
AI-generated summary
This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.
Community
This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.30060 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.30060 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.