Hugging Face Daily Papers · · 5 min read

UniT: Unified Geometry Learning with Group Autoregressive Transformer

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

🚀 <strong>UniT: Unified Geometry Learning with Group Autoregressive Transformer</strong></p>\n<p><strong>UniT</strong> is a unified feed-forward model for geometry perception.<br>It reformulates a wide range of geometry perception capabilities into a single framework, covering:</p>\n<ul>\n<li><strong>Diverse view configurations</strong> : supporting both online and offline inference over an arbitrary number of views </li>\n<li><strong>Flexible modality combinations</strong>: incorporating auxiliary inputs such as camera parameters and depth maps </li>\n<li><strong>Metric-scale perception</strong> : recovering geometry in real-world scale, measured in meters </li>\n<li><strong>Long-horizon scalability</strong> : maintaining bounded complexity over long-horizons</li>\n</ul>\n<p>📄 <strong>Paper:</strong> <a href=\"https://arxiv.org/abs/2605.21131\" rel=\"nofollow\">https://arxiv.org/abs/2605.21131</a><br>🌐 <strong>Project Page:</strong> <a href=\"https://sc2i-hkustgz.github.io/UniT/\" rel=\"nofollow\">https://sc2i-hkustgz.github.io/UniT/</a><br>🤗 <strong>Hugging Face Demo:</strong> <a href=\"https://enceladush-unit.hf.space/\" rel=\"nofollow\">https://enceladush-unit.hf.space/</a></p>\n","updatedAt":"2026-05-21T01:48:58.747Z","author":{"_id":"6879a214fe44f94c090c8344","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6879a214fe44f94c090c8344/ckru9lNZ4Sa_ZPVbdVUlr.png","fullname":"Haotian Wang","name":"Haotian-sx","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7334094643592834},"editors":["Haotian-sx"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6879a214fe44f94c090c8344/ckru9lNZ4Sa_ZPVbdVUlr.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.21131","authors":[{"_id":"6a0e6070164dbbc68a26c3ee","name":"Haotian Wang","hidden":false},{"_id":"6a0e6070164dbbc68a26c3ef","name":"Yusong Huang","hidden":false},{"_id":"6a0e6070164dbbc68a26c3f0","name":"Zhaonian Kuang","hidden":false},{"_id":"6a0e6070164dbbc68a26c3f1","name":"Hongliang Lu","hidden":false},{"_id":"6a0e6070164dbbc68a26c3f2","name":"Xinhu Zheng","hidden":false},{"_id":"6a0e6070164dbbc68a26c3f3","name":"Meng Yang","hidden":false},{"_id":"6a0e6070164dbbc68a26c3f4","name":"Gang Hua","hidden":false}],"publishedAt":"2026-05-20T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"UniT: Unified Geometry Learning with Group Autoregressive Transformer","submittedOnDailyBy":{"_id":"6879a214fe44f94c090c8344","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6879a214fe44f94c090c8344/ckru9lNZ4Sa_ZPVbdVUlr.png","isPro":false,"fullname":"Haotian Wang","user":"Haotian-sx","type":"user","name":"Haotian-sx"},"summary":"Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.","upvotes":3,"discussionId":"6a0e6070164dbbc68a26c3f5","projectPage":"https://sc2i-hkustgz.github.io/UniT/","githubRepo":"https://github.com/Wang-xjtu/UniT","githubRepoAddedBy":"user","ai_summary":"UniT presents a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms while maintaining metric-scale accuracy through scale-adaptive loss and queue-style KV caching.","ai_keywords":["Group Autoregressive Transformer","anchor-free","scale-adaptive","queue-style KV caching","scale-adaptive geometry loss","modal attention module","geometry perception","dense 3D structure","sensor observations","autoregressive units","point maps","long-horizon scalability","metric-scale estimation","multi-modal integration","online perception","offline reconstruction","unified model","relative geometric constraints","absolute scale term","progressive transition","geometry loss","attention mechanisms","vision transformers"],"githubStars":12},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6879a214fe44f94c090c8344","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6879a214fe44f94c090c8344/ckru9lNZ4Sa_ZPVbdVUlr.png","isPro":false,"fullname":"Haotian Wang","user":"Haotian-sx","type":"user"},{"_id":"6683fc5344a65be1aab25dc0","avatarUrl":"/avatars/e13cde3f87b59e418838d702807df3b5.svg","isPro":false,"fullname":"hjkim","user":"hojie11","type":"user"},{"_id":"619f9755da83161f25840698","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/619f9755da83161f25840698/FM421pE1mz5v1YhrxA8ZA.jpeg","isPro":false,"fullname":"Muhammad Umair","user":"umair894","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.21131.md"}">
Papers
arxiv:2605.21131

UniT: Unified Geometry Learning with Group Autoregressive Transformer

Published on May 20
· Submitted by
Haotian Wang
on May 21
Authors:
,
,
,
,
,
,

Abstract

UniT presents a unified feed-forward model for geometry perception using a Group Autoregressive Transformer that integrates multiple paradigms while maintaining metric-scale accuracy through scale-adaptive loss and queue-style KV caching.

AI-generated summary

Recent feed-forward models have significantly advanced geometry perception for inferring dense 3D structure from sensor observations. However, its essential capabilities remain fragmented across multiple incompatible paradigms, including online perception, offline reconstruction, multi-modal integration, long-horizon scalability, and metric-scale estimation. We present UniT, a unified model built upon a novel Group Autoregressive Transformer, which reformulates these seemingly disparate capabilities within a single framework. The key idea is to treat groups of sensor observations as the basic autoregressive units and predict the corresponding point maps in an anchor-free and scale-adaptive manner. More specifically, diverse view configurations in both online and offline settings are naturally unified within a single group autoregression process. By varying the group size, online mode operates over multiple autoregressive steps with single-frame groups, whereas offline mode aggregates a multi-frame group in a single forward pass. Meanwhile, a queue-style KV caching mechanism ensures bounded autoregressive memory over long horizons. This is enabled by reducing long-range dependencies on early frames through anchor-free relational modeling, thereby allowing outdated memory to be discarded on the fly. To improve metric-scale generalization across scenes, a scale-adaptive geometry loss is further introduced within this framework. It couples relative geometric constraints with a partial absolute scale term, implicitly regularizing global scale and inducing a progressive transition from scale-invariant geometry to metric-scale solutions. Together with a dedicated modal attention module for integrating auxiliary modalities, UniT achieves state-of-the-art performance in unified geometry perception, as validated on ten benchmarks spanning seven representative tasks.

Community

Paper submitter about 11 hours ago

🚀 UniT: Unified Geometry Learning with Group Autoregressive Transformer

UniT is a unified feed-forward model for geometry perception.
It reformulates a wide range of geometry perception capabilities into a single framework, covering:

  • Diverse view configurations : supporting both online and offline inference over an arbitrary number of views
  • Flexible modality combinations: incorporating auxiliary inputs such as camera parameters and depth maps
  • Metric-scale perception : recovering geometry in real-world scale, measured in meters
  • Long-horizon scalability : maintaining bounded complexity over long-horizons

📄 Paper: https://arxiv.org/abs/2605.21131
🌐 Project Page: https://sc2i-hkustgz.github.io/UniT/
🤗 Hugging Face Demo: https://enceladush-unit.hf.space/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.21131
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.21131 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.21131 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.21131 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers