Hugging Face Daily Papers · June 8, 2026 · 4 min read

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

🔢 Embodied AI cannot avoid numbers.\nFrom angles to distances and coordinates, numbers are everywhere in perception and action.\nBut are current VLMs ready for that?\nIn SᴘᴀᴄᴇNᴜᴍ, we revisit this spatial numerical understanding capability!\n","updatedAt":"2026-06-08T15:17:42.459Z","author":{"_id":"65d8b0f0661492b25c6623de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d8b0f0661492b25c6623de/c6LPDse8NIV_3BHIu8dYe.png","fullname":"Jianshu Zhang","name":"Sterzhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9200740456581116},"editors":["Sterzhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65d8b0f0661492b25c6623de/c6LPDse8NIV_3BHIu8dYe.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.23898","authors":[{"_id":"6a26dc0dda05d61ad5d10d8e","name":"Jianshu Zhang","hidden":false},{"_id":"6a26dc0dda05d61ad5d10d8f","name":"Yijiang Li","hidden":false},{"_id":"6a26dc0dda05d61ad5d10d90","name":"Huifeixin Chen","hidden":false},{"_id":"6a26dc0dda05d61ad5d10d91","name":"Haoran Lu","hidden":false},{"_id":"6a26dc0dda05d61ad5d10d92","name":"Letian Xue","hidden":false},{"_id":"6a26dc0dda05d61ad5d10d93","name":"Bingyang Wang","hidden":false},{"_id":"6a26dc0dda05d61ad5d10d94","name":"Han Liu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65d8b0f0661492b25c6623de/_oQCLewc5uXJ8kyHoh1eC.png"],"publishedAt":"2026-05-22T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"SPACENUM: Revisiting Spatial Numerical Understanding in VLMs","submittedOnDailyBy":{"_id":"65d8b0f0661492b25c6623de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d8b0f0661492b25c6623de/c6LPDse8NIV_3BHIu8dYe.png","isPro":false,"fullname":"Jianshu Zhang","user":"Sterzhang","type":"user","name":"Sterzhang"},"summary":"Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.","upvotes":5,"discussionId":"6a26dc0dda05d61ad5d10d95","projectPage":"https://sterzhang.github.io/SpaceNum-Home/","ai_summary":"Vision-language models struggle to genuinely understand spatial numerical concepts, relying instead on shallow visual cues rather than developing robust coordinate-aware representations.","ai_keywords":["Vision-Language Models","spatial numerical understanding","SpaceNum","Num2Space","Space2Num","bidirectional tasks","coordinate-aware representations","spatial reasoning","visual observations","numerical outputs"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65d8b0f0661492b25c6623de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d8b0f0661492b25c6623de/c6LPDse8NIV_3BHIu8dYe.png","isPro":false,"fullname":"Jianshu Zhang","user":"Sterzhang","type":"user"},{"_id":"68e87d8aa391a002c14fbdee","avatarUrl":"/avatars/535b4db3fc9e5d536030ad715bbd0e23.svg","isPro":false,"fullname":"LLaMA","user":"LoveinNLP","type":"user"},{"_id":"6734a86fc761398f4c0e1f2c","avatarUrl":"/avatars/ecd68762ccb8895260c84481e31df97e.svg","isPro":false,"fullname":"Ster","user":"sterrr","type":"user"},{"_id":"6419309f22270b3ccf177c77","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6419309f22270b3ccf177c77/KQa1586iBBKqucUlfpuPp.jpeg","isPro":true,"fullname":"William Li","user":"williamium","type":"user"},{"_id":"65080267fec2f3763562ba94","avatarUrl":"/avatars/31a397c5c799cfb68ad8984edc960da7.svg","isPro":false,"fullname":"Yankai Fu","user":"Auroraky","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.23898.md"}">

Papers

arxiv:2605.23898

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Published on May 22

· Submitted by

Jianshu Zhang on Jun 8

Upvote

Authors:

Abstract

Vision-language models struggle to genuinely understand spatial numerical concepts, relying instead on shallow visual cues rather than developing robust coordinate-aware representations.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.

View arXiv page View PDF Project page Add to collection

Community

Sterzhang

Paper submitter about 5 hours ago

🔢 Embodied AI cannot avoid numbers.

From angles to distances and coordinates, numbers are everywhere in perception and action.

But are current VLMs ready for that?

In SᴘᴀᴄᴇNᴜᴍ, we revisit this spatial numerical understanding capability!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.23898

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.23898 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.23898 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.23898 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

SPACENUM: Revisiting Spatial Numerical Understanding in VLMs

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers