The paper can be also found in <a href=\"https://github.com/Hongcheng-Gao/SpatialWorld/blob/main/SpatialWorld.pdf\" rel=\"nofollow\">https://github.com/Hongcheng-Gao/SpatialWorld/blob/main/SpatialWorld.pdf</a></p>\n","updatedAt":"2026-06-09T02:27:03.445Z","author":{"_id":"62728f4f6253fe2068da1021","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62728f4f6253fe2068da1021/KZ65X0EH98AF3zXemPiap.jpeg","fullname":"Hongcheng Gao","name":"HongchengGao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.884659469127655},"editors":["HongchengGao"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62728f4f6253fe2068da1021/KZ65X0EH98AF3zXemPiap.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09669","authors":[{"_id":"6a2775fc6dde1c5ef75bcec2","name":"Hongcheng Gao","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec3","name":"Hailong Qu","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec4","name":"Jingyi Tang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec5","name":"Jiahao Wang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec6","name":"Zihao Huang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec7","name":"Hengkang Qiao","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec8","name":"Shihong Huang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcec9","name":"Junming Yang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bceca","name":"Yi Li","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcecb","name":"Hongyixuan Yuan","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcecc","name":"Wenjie Li","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcecd","name":"Bohan Zeng","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcece","name":"Wenbo Li","hidden":false},{"_id":"6a2775fc6dde1c5ef75bcecf","name":"Bo Wang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced0","name":"Jianhui Liu","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced1","name":"Olive Huang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced2","name":"Haoyang Huang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced3","name":"Wentao Zhang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced4","name":"Guoqing Huang","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced5","name":"Nan Duan","hidden":false},{"_id":"6a2775fc6dde1c5ef75bced6","name":"Yinpeng Dong","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks","submittedOnDailyBy":{"_id":"62728f4f6253fe2068da1021","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62728f4f6253fe2068da1021/KZ65X0EH98AF3zXemPiap.jpeg","isPro":false,"fullname":"Hongcheng Gao","user":"HongchengGao","type":"user","name":"HongchengGao"},"summary":"Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.","upvotes":25,"discussionId":"6a2775fd6dde1c5ef75bced7","projectPage":"https://spatial-world.github.io","githubRepo":"https://github.com/Hongcheng-Gao/SpatialWorld","githubRepoAddedBy":"user","ai_summary":"SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions.","ai_keywords":["multimodal large language models","spatial reasoning","simulation backends","partial observability","text-based action interface","task success rate","active exploration","long-horizon planning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4,"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62728f4f6253fe2068da1021","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62728f4f6253fe2068da1021/KZ65X0EH98AF3zXemPiap.jpeg","isPro":false,"fullname":"Hongcheng Gao","user":"HongchengGao","type":"user"},{"_id":"67e3764182a1765f14e0740e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67e3764182a1765f14e0740e/MWnuIBu6S19QTVMYyoGZo.jpeg","isPro":false,"fullname":"Hailong Qu","user":"TTTCHHDDD","type":"user"},{"_id":"66cae43ddf893af1ae766d8d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66cae43ddf893af1ae766d8d/l9otZpUlYRgLTg6VzNNso.jpeg","isPro":false,"fullname":"David Color","user":"Colorfulnia","type":"user"},{"_id":"6447756ae6161a1f32e1c734","avatarUrl":"/avatars/27fabb6e85b7405c1668201ce7cd51aa.svg","isPro":false,"fullname":"bowang","user":"bwang3579","type":"user"},{"_id":"664b4a748dd1bfb5a3a970fe","avatarUrl":"/avatars/37aa9332ab3e8fbb6ae30b875a7e0e5a.svg","isPro":false,"fullname":"Jiahao Wang","user":"GenuineWWD","type":"user"},{"_id":"653f1d243bd61358055ad51d","avatarUrl":"/avatars/698c03b9a4bb69659d2ed594626e3895.svg","isPro":false,"fullname":"junmingyang","user":"jmyang","type":"user"},{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbhpku","type":"user"},{"_id":"69ce390201d713064aea5864","avatarUrl":"/avatars/af3977aeb5432599fb6b576c3f64a46b.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbhpku1","type":"user"},{"_id":"660781a450d2b7a71091240d","avatarUrl":"/avatars/da9439b8920605d8427893d0ebc32dfa.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbh0217","type":"user"},{"_id":"6474592eb68461d5cf790990","avatarUrl":"/avatars/49bda344f3a4ef9131e5abc23c30e117.svg","isPro":false,"fullname":"Joel","user":"Joel1824","type":"user"},{"_id":"66915a572c1a3a8edcc977b4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66915a572c1a3a8edcc977b4/2tANTgj48VQMgCcEcdkwE.jpeg","isPro":false,"fullname":"Yuwei Niu","user":"Yuwei-Niu","type":"user"},{"_id":"6552eaa57a08f277793dfd6a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6552eaa57a08f277793dfd6a/9Uk9Ef5Tymm-nzcuIjrOf.jpeg","isPro":false,"fullname":"Jiang Zhou","user":"JayceCeleste","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"628735cbc83a2d6ab8d14a66","name":"Tsinghua","fullname":"Tsinghua University","avatar":"https://www.gravatar.com/avatar/6c5c1441e3283e7543342e59277ea219?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09669.md"}">
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
SpatialWorld presents a unified benchmark for evaluating interactive spatial understanding in multimodal agents through diverse real-world tasks with partial observability and text-based actions.
Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks. Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier. Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.09669 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.09669 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.09669 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.