Hugging Face Daily Papers · June 2, 2026 · 4 min read

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

As we push toward Vision-First architectures for robotics, a critical question remains: Which pre-training scheme provides the best substrate for Spatial Intelligence? VLM or VGM?</p>\n<p>To find out, we built a lightweight, frozen-feature probing framework to evaluate both model families across three axes of physical understanding.</p>\n","updatedAt":"2026-06-02T01:46:56.287Z","author":{"_id":"5f0de36419cb630495b8153c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658676776546-5f0de36419cb630495b8153c.jpeg","fullname":"Tony Zhao","name":"tianchez","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":20,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.904668927192688},"editors":["tianchez"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1658676776546-5f0de36419cb630495b8153c.jpeg"],"reactions":[{"reaction":"🔥","users":["yabel","SZhanZ","tianchez"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.28132","authors":[{"_id":"6a1e35d0808ddbc3c7d43bac","name":"Haozhan Shen","hidden":false},{"_id":"6a1e35d0808ddbc3c7d43bad","name":"Tiancheng Zhao","hidden":false},{"_id":"6a1e35d0808ddbc3c7d43bae","name":"Kangjia Zhao","hidden":false},{"_id":"6a1e35d0808ddbc3c7d43baf","name":"Jianwei Yin","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models","submittedOnDailyBy":{"_id":"5f0de36419cb630495b8153c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658676776546-5f0de36419cb630495b8153c.jpeg","isPro":false,"fullname":"Tony Zhao","user":"tianchez","type":"user","name":"tianchez"},"summary":"Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at https://github.com/om-ai-lab/Probing-VLM-VGM{https://github.com/om-ai-lab/Probing-VLM-VGM}.","upvotes":16,"discussionId":"6a1e35d0808ddbc3c7d43bb0","githubRepo":"https://github.com/om-ai-lab/Probing-VLM-VGM","githubRepoAddedBy":"user","ai_summary":"A systematic comparison of vision-language models and video generation models reveals complementary strengths for spatial intelligence tasks, with vision-language models excelling in semantic tagging and instance grouping while video generation models perform better in dense geometry and camera motion prediction.","ai_keywords":["Vision-Language Models","Video Generation Models","spatial intelligence","semantic tagging","instance grouping","3D geometry prediction","frozen-feature probing","feature fusion"],"githubStars":4,"organization":{"_id":"62dd5fdac33f9cb60bf668ad","name":"omlab","fullname":"Om AI Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f0de36419cb630495b8153c/0T0ttw9sIEIerOZ1L1Zfm.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5f0de36419cb630495b8153c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1658676776546-5f0de36419cb630495b8153c.jpeg","isPro":false,"fullname":"Tony Zhao","user":"tianchez","type":"user"},{"_id":"65742bf5015c459f1b7777e6","avatarUrl":"/avatars/0109d772a901a8090273abac045512e0.svg","isPro":false,"fullname":"Peng Liu","user":"P3ngLiu","type":"user"},{"_id":"65768065b238c76bba24a835","avatarUrl":"/avatars/e1e6f3a627d3a08dc62b3faa652f0aea.svg","isPro":false,"fullname":"Yibo Ma","user":"yabel","type":"user"},{"_id":"656e8fc82e0a38afd18fd996","avatarUrl":"/avatars/2bac70dff5f974d2bba83acf40141e24.svg","isPro":false,"fullname":"KeleiJiang","user":"jkl375","type":"user"},{"_id":"6461d22cddb3aaa43c8b20b8","avatarUrl":"/avatars/9692cd0bcde8f012d823c17dab6f23bd.svg","isPro":false,"fullname":"Qianqian","user":"qq-hzlh","type":"user"},{"_id":"664da7874eb4c91c8c32d5cc","avatarUrl":"/avatars/42e2e5850404f1bf0f161e188300b830.svg","isPro":false,"fullname":"yyl","user":"yyl123ddd","type":"user"},{"_id":"687a256d113f5e62f81b8011","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/m8qOlD3sI_egJuiBRrlJv.png","isPro":false,"fullname":"ttaid","user":"ttaid","type":"user"},{"_id":"641d44b121964f8f6d4b213e","avatarUrl":"/avatars/af38a6977313e9d4dcaa485698cb622b.svg","isPro":false,"fullname":"Ying","user":"Heting","type":"user"},{"_id":"6846c0aac79bdc4ce6c32e74","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/QWgHRSgs2gAjtprxik-_L.png","isPro":false,"fullname":"Dai","user":"RINGOwO19","type":"user"},{"_id":"687363d49a81c7dcbcfa2d84","avatarUrl":"/avatars/5d943a5c811ed931c3fdcfee19253049.svg","isPro":false,"fullname":"jj","user":"realman123","type":"user"},{"_id":"69af7d90164b3dcc95c96cdf","avatarUrl":"/avatars/7fed3d8a2124910bef30fb7df9140422.svg","isPro":false,"fullname":"kak","user":"Kaowai","type":"user"},{"_id":"64f05f261a108efe45dfeda1","avatarUrl":"/avatars/cd9e68dbede75a34cf286569e60ac2af.svg","isPro":false,"fullname":"Haozhan Shen","user":"SZhanZ","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"62dd5fdac33f9cb60bf668ad","name":"omlab","fullname":"Om AI Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f0de36419cb630495b8153c/0T0ttw9sIEIerOZ1L1Zfm.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.28132.md"}">

Papers

arxiv:2605.28132

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Published on May 27

· Submitted by

Tony Zhao on Jun 2

Om AI Lab

Upvote

Authors:

Abstract

A systematic comparison of vision-language models and video generation models reveals complementary strengths for spatial intelligence tasks, with vision-language models excelling in semantic tagging and instance grouping while video generation models perform better in dense geometry and camera motion prediction.

AI-generated summary

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at https://github.com/om-ai-lab/Probing-VLM-VGM{https://github.com/om-ai-lab/Probing-VLM-VGM}.

View arXiv page View PDF GitHub 4 Add to collection

Community

tianchez

Paper submitter about 8 hours ago

As we push toward Vision-First architectures for robotics, a critical question remains: Which pre-training scheme provides the best substrate for Spatial Intelligence? VLM or VGM?

To find out, we built a lightweight, frozen-feature probing framework to evaluate both model families across three axes of physical understanding.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.28132

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.28132 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.28132 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.28132 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers