Hugging Face Daily Papers · · 4 min read

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

The paper can be also found in <a href=\"https://github.com/Hongcheng-Gao/SpatialWorld/blob/main/SpatialWorld.pdf\" rel=\"nofollow\">https://github.com/Hongcheng-Gao/SpatialWorld/blob/main/SpatialWorld.pdf</a></p>\n","updatedAt":"2026-06-09T02:27:03.445Z","author":{"_id":"62728f4f6253fe2068da1021","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62728f4f6253fe2068da1021/KZ65X0EH98AF3zXemPiap.jpeg","fullname":"Hongcheng Gao","name":"HongchengGao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.884659469127655},"editors":["HongchengGao"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62728f4f6253fe2068da1021/KZ65X0EH98AF3zXemPiap.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09669","authors":[{"_id":"6a2775fc6dde1c5ef75bcec2","name":"Hongcheng Gao","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec3","name":"Hailong Qu","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec4","name":"Jingyi Tang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec5","name":"Jiahao Wang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec6","name":"Zihao Huang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec7","name":"Hengkang Qiao","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec8","name":"Shihong Huang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec9","name":"Junming Yang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bceca","name":"Yi Li","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcecb","name":"Hongyixuan Yuan","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcecc","name":"Wenjie Li","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcecd","name":"Bohan Zeng","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcece","name":"Wenbo Li","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcecf","name":"Bo Wang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced0","name":"Jianhui Liu","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced1","name":"Olive Huang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced2","name":"Haoyang Huang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced3","name":"Wentao Zhang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced4","name":"Guoqing Huang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced5","name":"Nan Duan","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced6","name":"Yinpeng Dong","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks","submittedOnDailyBy":{"_id":"62728f4f6253fe2068da1021","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62728f4f6253fe2068da1021/KZ65X0EH98AF3zXemPiap.jpeg","isPro":false,"fullname":"Hongcheng Gao","user":"HongchengGao","type":"user","name":"HongchengGao"},"summary":"Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.","upvotes":25,"discussionId":"6a2775fd6dde1c5ef75bced7","projectPage":"https://spatial-world.github.io","githubRepo":"https://github.com/Hongcheng-Gao/SpatialWorld","githubRepoAddedBy":"user","ai_summary":"SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions.","ai_keywords":["multimodal large language models","spatial reasoning","simulation backends","partial observability","text-based action interface","task success rate","active exploration","long-horizon planning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4,"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62728f4f6253fe2068da1021","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62728f4f6253fe2068da1021/KZ65X0EH98AF3zXemPiap.jpeg","isPro":false,"fullname":"Hongcheng Gao","user":"HongchengGao","type":"user"},{"_id":"67e3764182a1765f14e0740e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67e3764182a1765f14e0740e/MWnuIBu6S19QTVMYyoGZo.jpeg","isPro":false,"fullname":"Hailong Qu","user":"TTTCHHDDD","type":"user"},{"_id":"66cae43ddf893af1ae766d8d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66cae43ddf893af1ae766d8d/l9otZpUlYRgLTg6VzNNso.jpeg","isPro":false,"fullname":"David Color","user":"Colorfulnia","type":"user"},{"_id":"6447756ae6161a1f32e1c734","avatarUrl":"/avatars/27fabb6e85b7405c1668201ce7cd51aa.svg","isPro":false,"fullname":"bowang","user":"bwang3579","type":"user"},{"_id":"664b4a748dd1bfb5a3a970fe","avatarUrl":"/avatars/37aa9332ab3e8fbb6ae30b875a7e0e5a.svg","isPro":false,"fullname":"Jiahao Wang","user":"GenuineWWD","type":"user"},{"_id":"653f1d243bd61358055ad51d","avatarUrl":"/avatars/698c03b9a4bb69659d2ed594626e3895.svg","isPro":false,"fullname":"junmingyang","user":"jmyang","type":"user"},{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbhpku","type":"user"},{"_id":"69ce390201d713064aea5864","avatarUrl":"/avatars/af3977aeb5432599fb6b576c3f64a46b.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbhpku1","type":"user"},{"_id":"660781a450d2b7a71091240d","avatarUrl":"/avatars/da9439b8920605d8427893d0ebc32dfa.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbh0217","type":"user"},{"_id":"6474592eb68461d5cf790990","avatarUrl":"/avatars/49bda344f3a4ef9131e5abc23c30e117.svg","isPro":false,"fullname":"Joel","user":"Joel1824","type":"user"},{"_id":"66915a572c1a3a8edcc977b4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66915a572c1a3a8edcc977b4/2tANTgj48VQMgCcEcdkwE.jpeg","isPro":false,"fullname":"Yuwei Niu","user":"Yuwei-Niu","type":"user"},{"_id":"6552eaa57a08f277793dfd6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6552eaa57a08f277793dfd6a/9Uk9Ef5Tymm-nzcuIjrOf.jpeg","isPro":false,"fullname":"Jiang Zhou","user":"JayceCeleste","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09669.md"}">
Papers
arxiv:2606.09669

SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Published on Jun 8
· Submitted by
Hongcheng Gao
on Jun 9
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions.

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.09669
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09669 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09669 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09669 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers