Hugging Face Daily Papers · · 5 min read

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

\"Code is the right action interface for spatial reasoning!!\"</p>\n<p>SpatialClaw lets a VLM-backed agent write Python in a persistent kernel, composing perception modules, inspecting intermediate results, and revising its strategy across steps.</p>\n<p>It is training-free, with no benchmark- or model-specific adaptation, yet it beats a recent prior agent by <strong>+11.2</strong> points on <strong>20</strong> benchmarks and improves consistently across six VLM backbones.</p>\n","updatedAt":"2026-06-12T02:09:58.502Z","author":{"_id":"64ae22dd1aee69ece065cdcd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae22dd1aee69ece065cdcd/JG7QaHIrr4i2k4uwR4pZK.png","fullname":"Min-Hung Chen","name":"cmhungsteve","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":19,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png","fullname":"NVIDIA","name":"nvidia","type":"org","isHf":false,"plan":"plus"}}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8359644412994385},"editors":["cmhungsteve"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64ae22dd1aee69ece065cdcd/JG7QaHIrr4i2k4uwR4pZK.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.13673","authors":[{"_id":"6a2b69f04957fcdd3aac0602","name":"Seokju Cho","hidden":false},{"_id":"6a2b69f04957fcdd3aac0603","name":"Ryo Hachiuma","hidden":false},{"_id":"6a2b69f04957fcdd3aac0604","name":"Abhishek Badki","hidden":false},{"_id":"6a2b69f04957fcdd3aac0605","name":"Hang Su","hidden":false},{"_id":"6a2b69f04957fcdd3aac0606","user":{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user","name":"BK-Lee"},"name":"Byung-Kwan Lee","status":"claimed_verified","statusLastChangedAt":"2026-06-12T06:56:59.203Z","hidden":false},{"_id":"6a2b69f04957fcdd3aac0607","name":"Chan Hee Song","hidden":false},{"_id":"6a2b69f04957fcdd3aac0608","name":"Sifei Liu","hidden":false},{"_id":"6a2b69f04957fcdd3aac0609","name":"Subhashree Radhakrishnan","hidden":false},{"_id":"6a2b69f04957fcdd3aac060a","name":"Seungryong Kim","hidden":false},{"_id":"6a2b69f04957fcdd3aac060b","name":"Yu-Chiang Frank Wang","hidden":false},{"_id":"6a2b69f04957fcdd3aac060c","user":{"_id":"64ae22dd1aee69ece065cdcd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae22dd1aee69ece065cdcd/JG7QaHIrr4i2k4uwR4pZK.png","isPro":false,"fullname":"Min-Hung Chen","user":"cmhungsteve","type":"user","name":"cmhungsteve"},"name":"Min-Hung Chen","status":"claimed_verified","statusLastChangedAt":"2026-06-12T06:57:01.112Z","hidden":false}],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-12T00:00:00.000Z","title":"SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning","submittedOnDailyBy":{"_id":"64ae22dd1aee69ece065cdcd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae22dd1aee69ece065cdcd/JG7QaHIrr4i2k4uwR4pZK.png","isPro":false,"fullname":"Min-Hung Chen","user":"cmhungsteve","type":"user","name":"cmhungsteve"},"summary":"Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.","upvotes":64,"discussionId":"6a2b69f04957fcdd3aac060d","projectPage":"https://spatialclaw.github.io/","githubRepo":"https://github.com/NVlabs/SpatialClaw","githubRepoAddedBy":"user","ai_summary":"SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks.","ai_keywords":["vision-language models","spatial reasoning","tool-augmented agents","action interface","code execution","Python kernel","perception primitives","geometry primitives","executable cells","open-ended reasoning","3D/4D spatial reasoning","benchmarks","VLM backbones"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64ae22dd1aee69ece065cdcd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae22dd1aee69ece065cdcd/JG7QaHIrr4i2k4uwR4pZK.png","isPro":false,"fullname":"Min-Hung Chen","user":"cmhungsteve","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"657152eb12f162153b50ec9d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/657152eb12f162153b50ec9d/qnldHP35PclV0pDz_05q8.jpeg","isPro":false,"fullname":"Byung-Kwan Lee","user":"BK-Lee","type":"user"},{"_id":"685f4921242fe044d63e7a32","avatarUrl":"/avatars/d40c8a1eed3a7128fc95ec0fbb329784.svg","isPro":false,"fullname":"Chi-Pin Huang","user":"jasper0314","type":"user"},{"_id":"666afb91e936f6cbcfc8b50c","avatarUrl":"/avatars/a618c074c9e11e6b9444d0e366efbbdf.svg","isPro":false,"fullname":"LIN, CHIN-YANG","user":"linjohnss","type":"user"},{"_id":"64b74920fe6a108d03fed767","avatarUrl":"/avatars/a2c05b809c36fa5fab8e1a43b3e67051.svg","isPro":false,"fullname":"Minki Kang","user":"Nardien","type":"user"},{"_id":"6a0078ea43b24f8c206966f0","avatarUrl":"/avatars/e3a4f2748e1e6b29ff854290b60e20a7.svg","isPro":false,"fullname":"chaehyun kim","user":"kchyunv","type":"user"},{"_id":"602e45160daeb0df2a81b244","avatarUrl":"/avatars/f6bf69f0c1342f8cfad05d5775e59bf4.svg","isPro":true,"fullname":"Seokju Cho","user":"hamacojr","type":"user"},{"_id":"66e33ec41445942fb40fd4a9","avatarUrl":"/avatars/b49b1acc01b9fb12c329320744704582.svg","isPro":true,"fullname":"Polina Zhang","user":"PolinAvA","type":"user"},{"_id":"6819c16353612b577d082401","avatarUrl":"/avatars/fc9d5c14230048cabb6a1ac9ac94f8f9.svg","isPro":false,"fullname":"Sung-Feng Huang","user":"sungfengh","type":"user"},{"_id":"63484779489226e8b86f1d4d","avatarUrl":"/avatars/a1e703cfbd6574e7fbddfa8e6dfe09f9.svg","isPro":false,"fullname":"Cheng Sun","user":"sunset1995","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.13673.md","query":{}}">
Papers
arxiv:2606.13673

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

Published on Jun 11
· Submitted by
Min-Hung Chen
on Jun 12
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,

Abstract

SpatialClaw is a training-free framework that uses code as an action interface to enable flexible, stateful spatial reasoning in vision-language models, achieving superior performance across diverse 3D/4D spatial reasoning tasks.

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

Community

Paper author Paper submitter about 8 hours ago

"Code is the right action interface for spatial reasoning!!"

SpatialClaw lets a VLM-backed agent write Python in a persistent kernel, composing perception modules, inspecting intermediate results, and revising its strategy across steps.

It is training-free, with no benchmark- or model-specific adaptation, yet it beats a recent prior agent by +11.2 points on 20 benchmarks and improves consistently across six VLM backbones.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.13673
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.13673 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.13673 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.13673 in a Space README.md to link it from this page.

Collections including this paper 4

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers