Hugging Face Daily Papers · · 4 min read

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

The first agentic model for Spatial Intelligence.<br>S-Agent turns perception into action: grounding, reconstructing, and reasoning with tools to solve complex spatial tasks step by step.</p>\n","updatedAt":"2026-06-19T03:32:07.491Z","author":{"_id":"667b8de7a68bf81afe668afe","avatarUrl":"/avatars/aeff10805ff858332e6f6a58735dbbd9.svg","fullname":"leoli","name":"lifuguan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9151774048805237},"editors":["lifuguan"],"editorAvatarUrls":["/avatars/aeff10805ff858332e6f6a58735dbbd9.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.20515","authors":[{"_id":"6a34b8084c5c5e0d69bf1cbc","name":"Yalun Dai","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cbd","name":"Hao Li","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cbe","name":"Shulin Tian","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cbf","name":"Runmao Yao","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cc0","name":"Yuhao Dong","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cc1","name":"Fangzhou Hong","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cc2","name":"Zhaoxi Chen","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cc3","name":"Fangfu Liu","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cc4","name":"Baoliang Tian","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cc5","name":"Dingwen Zhang","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cc6","name":"Tao Wang","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cc7","name":"Kim-Hui Yap","hidden":false},{"_id":"6a34b8084c5c5e0d69bf1cc8","name":"Ziwei Liu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/667b8de7a68bf81afe668afe/nexnSTuenbS1WDVnwai4f.mp4"],"publishedAt":"2026-06-18T00:00:00.000Z","submittedOnDailyAt":"2026-06-19T00:00:00.000Z","title":"S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence","submittedOnDailyBy":{"_id":"667b8de7a68bf81afe668afe","avatarUrl":"/avatars/aeff10805ff858332e6f6a58735dbbd9.svg","isPro":false,"fullname":"leoli","user":"lifuguan","type":"user","name":"lifuguan"},"summary":"Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \\textsc{S-Agent}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, S-Agent reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, S-Agent casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (e.g., counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that S-Agent consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on S-Agent-generated spatial trajectories S-300K yields S-Agent-8B, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).","upvotes":20,"discussionId":"6a34b8094c5c5e0d69bf1cc9","projectPage":"https://ropedia.github.io/S-Agent","githubRepo":"https://github.com/Ropedia/S-Agent","githubRepoAddedBy":"user","ai_summary":"S-Agent is a spatial reasoning framework that enhances visual language models with temporal memory and hierarchical spatial tools to enable continuous 3D world understanding from multi-view imagery.","ai_keywords":["spatial reasoning","visual language models","temporal memory mechanism","scene memory","agent memory","spatial tools","3D geometric evidence","spatio-temporal evidence accumulation","multi-view images","video spatial reasoning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6,"organization":{"_id":"69b1179815ed2b14d9bcea99","name":"ropedia-ai","fullname":"Ropedia","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/69b116c65a198623dcbcc950/QFbPRKJTaUrMeF_rk7_4x.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"667b8de7a68bf81afe668afe","avatarUrl":"/avatars/aeff10805ff858332e6f6a58735dbbd9.svg","isPro":false,"fullname":"leoli","user":"lifuguan","type":"user"},{"_id":"64749a0d5aba8edfb2eeaba7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64749a0d5aba8edfb2eeaba7/Tiy4DEdp3KQYh7Ij8Vmkn.png","isPro":false,"fullname":"Mutian Xu","user":"Minoday","type":"user"},{"_id":"652d06833b5997ed71ce5c46","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652d06833b5997ed71ce5c46/O_D6bpa5mGxLA7uCjmVCG.jpeg","isPro":false,"fullname":"Zhongang Cai","user":"caizhongang","type":"user"},{"_id":"6658d01c6f1a71ba56d6c273","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/tc4nZrMuZQLfgt5aVxtH4.jpeg","isPro":false,"fullname":"Tian Shulin","user":"shulin16","type":"user"},{"_id":"66d347eebb76fb26eedb256e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66d347eebb76fb26eedb256e/iCPF7GkmZu--XCsWzoucl.jpeg","isPro":false,"fullname":"tianqi liu","user":"tqliu","type":"user"},{"_id":"631b24f2f6bc4be4a64c4d43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/631b24f2f6bc4be4a64c4d43/P9_tVF7SESmVxxGKVCgCk.jpeg","isPro":true,"fullname":"Zihao Huang","user":"Inso","type":"user"},{"_id":"69b116c65a198623dcbcc950","avatarUrl":"/avatars/18f02ec3df55a8cdaf6f11f854b2589b.svg","isPro":false,"fullname":"Ropedia","user":"Ropedia","type":"user"},{"_id":"66627c77f8d1fcc749b9abfb","avatarUrl":"/avatars/e097a2cde793366ea9861fef66255444.svg","isPro":false,"fullname":"lironghui","user":"lironghui","type":"user"},{"_id":"652965773a416e1f2173443b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg","isPro":true,"fullname":"Yuhao Dong","user":"THUdyh","type":"user"},{"_id":"65af6f6b52e1b2aae437af2e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65af6f6b52e1b2aae437af2e/sFC98zLL_ZPS9fvZFi01W.jpeg","isPro":false,"fullname":"Ziang Cao","user":"Caoza","type":"user"},{"_id":"66aa94cbd59743aa4a65646f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66aa94cbd59743aa4a65646f/Z6BKgT_oYcfJgGHQNm9K3.png","isPro":false,"fullname":"Runmao Yao","user":"yaorunmao","type":"user"},{"_id":"69c8af92851b279c4da20fbb","avatarUrl":"/avatars/6ae9352bf0b0f9391e3e0d388d4ad5d1.svg","isPro":false,"fullname":"PENGHAOSONG","user":"HarrisonPENG","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"69b1179815ed2b14d9bcea99","name":"ropedia-ai","fullname":"Ropedia","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/69b116c65a198623dcbcc950/QFbPRKJTaUrMeF_rk7_4x.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.20515.md","query":{}}">
Papers
arxiv:2606.20515

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

Published on Jun 18
· Submitted by
leoli
on Jun 19
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

S-Agent is a spatial reasoning framework that enhances visual language models with temporal memory and hierarchical spatial tools to enable continuous 3D world understanding from multi-view imagery.

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textsc{S-Agent}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, S-Agent reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, S-Agent casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (e.g., counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that S-Agent consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on S-Agent-generated spatial trajectories S-300K yields S-Agent-8B, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

Community

Paper submitter about 4 hours ago

The first agentic model for Spatial Intelligence.
S-Agent turns perception into action: grounding, reconstructing, and reasoning with tools to solve complex spatial tasks step by step.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.20515
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.20515 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.20515 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.20515 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers