Spatial reasoning should be revisitable. Given an egocentric video, an MLLM commits to plausible but wrong answers when the camera trajectory leaves key evidence occluded. The proposed ReRe forms an initial hypothesis (Reason), then revisits it under a synthesized novel view (Re-reason) that exposes the complementary geometry, flipping wrong answers to right, as shown below for object counting and route planning.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/63048965eb6d777a838cb7a8/E5kvb_kmlO1_7Vhsf3w68.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/63048965eb6d777a838cb7a8/E5kvb_kmlO1_7Vhsf3w68.png\" alt=\"image\"></a></p>\n","updatedAt":"2026-06-11T05:02:29.746Z","author":{"_id":"63048965eb6d777a838cb7a8","avatarUrl":"/avatars/b987fb7f630443bf94a03daf8dcbffe9.svg","fullname":"chaofanma","name":"chaofanma","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8588785529136658},"editors":["chaofanma"],"editorAvatarUrls":["/avatars/b987fb7f630443bf94a03daf8dcbffe9.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11683","authors":[{"_id":"6a2a3f4c80a9c7c6830c1039","name":"Chaofan Ma","hidden":false},{"_id":"6a2a3f4c80a9c7c6830c103a","name":"Zhenjie Mao","hidden":false},{"_id":"6a2a3f4c80a9c7c6830c103b","name":"Yuhuan Yang","hidden":false},{"_id":"6a2a3f4c80a9c7c6830c103c","name":"Fanqin Zeng","hidden":false},{"_id":"6a2a3f4c80a9c7c6830c103d","name":"Yue Shi","hidden":false},{"_id":"6a2a3f4c80a9c7c6830c103e","name":"Yingjie Zhou","hidden":false},{"_id":"6a2a3f4c80a9c7c6830c103f","name":"Xiaofeng Cao","hidden":false},{"_id":"6a2a3f4c80a9c7c6830c1040","name":"Jiangchao Yao","hidden":false}],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning","submittedOnDailyBy":{"_id":"63048965eb6d777a838cb7a8","avatarUrl":"/avatars/b987fb7f630443bf94a03daf8dcbffe9.svg","isPro":false,"fullname":"chaofanma","user":"chaofanma","type":"user","name":"chaofanma"},"summary":"Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/","upvotes":27,"discussionId":"6a2a3f4c80a9c7c6830c1041","projectPage":"https://zhenjiemao.github.io/ReRe/","ai_summary":"A training-free framework for spatial reasoning from egocentric videos that enables revisiting conclusions through synthesized novel-view videos generated from predicted 3D geometry.","ai_keywords":["spatial reasoning","egocentric videos","MLLM","spatial hypothesis","cross-view revisiting","Geometry-to-Video pipeline","novel-view video","3D geometry","VSI-Bench","STI-Bench"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63048965eb6d777a838cb7a8","avatarUrl":"/avatars/b987fb7f630443bf94a03daf8dcbffe9.svg","isPro":false,"fullname":"chaofanma","user":"chaofanma","type":"user"},{"_id":"6540e3b91579bd6b098b7716","avatarUrl":"/avatars/a8c279da3dda178dd5f2e65b626e24f9.svg","isPro":false,"fullname":"zhenjiemao","user":"zhenjiemao","type":"user"},{"_id":"6a069177a01745697eb21189","avatarUrl":"/avatars/2cff7b99d89036a327c47e1cf3220617.svg","isPro":false,"fullname":"shiyue001","user":"shiyue0011","type":"user"},{"_id":"69fb1f6a3effe427e5ca8778","avatarUrl":"/avatars/1b234d2cf76adb099547be638e1b0b61.svg","isPro":false,"fullname":"LiveProteinBench","user":"LiveProteinBench","type":"user"},{"_id":"675a621105c46a17e6a229b3","avatarUrl":"/avatars/1b51f141df9d206fcbd2598b6e994aa6.svg","isPro":false,"fullname":"Dingyi Rong","user":"dingyi11","type":"user"},{"_id":"6731af65389aca4be7ce8a75","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/6Ym2bfkiJzKOtDZ3LCdFg.png","isPro":false,"fullname":"Cumulus","user":"CumulusAlpha","type":"user"},{"_id":"6655d5575b8ab1ed4f66265d","avatarUrl":"/avatars/1fd6da28eba1c804cad1cc490b374eac.svg","isPro":true,"fullname":"Chen Ye","user":"sjtuchenye","type":"user"},{"_id":"6522cab10d3a8171bdc57883","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6522cab10d3a8171bdc57883/xp90mJppk9biGFdrhwHUh.jpeg","isPro":false,"fullname":"Yingjie Zhou","user":"zyj2000","type":"user"},{"_id":"65ddc1e4c11f12ea9698bac8","avatarUrl":"/avatars/4c2193f3a2598ea2e5ce350b66f3cbed.svg","isPro":false,"fullname":"FFF","user":"QinFFF","type":"user"},{"_id":"6944e9843cd5eeb7a787cbdd","avatarUrl":"/avatars/d50d4b888b41f608eb6e7475fa3f5f29.svg","isPro":false,"fullname":"Apel Tato","user":"happyteddybear","type":"user"},{"_id":"6302f300056ec3a2a8754943","avatarUrl":"/avatars/f111b7061633d3e65dd30967a6b68c96.svg","isPro":false,"fullname":"Yuhuan Yang","user":"yuhuanyang","type":"user"},{"_id":"65025370b6595dc45c397340","avatarUrl":"/avatars/9469599b176034548042922c0afa7051.svg","isPro":false,"fullname":"J C","user":"dark-pen","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63e5ef7bf2e9a8f22c515654","name":"SJTU","fullname":"Shanghai Jiao Tong University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676013394657-63e5ee22b6a40bf941da0928.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11683.md"}">
Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning
Abstract
A training-free framework for spatial reasoning from egocentric videos that enables revisiting conclusions through synthesized novel-view videos generated from predicted 3D geometry.
Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/
Community
Spatial reasoning should be revisitable. Given an egocentric video, an MLLM commits to plausible but wrong answers when the camera trajectory leaves key evidence occluded. The proposed ReRe forms an initial hypothesis (Reason), then revisits it under a synthesized novel view (Re-reason) that exposes the complementary geometry, flipping wrong answers to right, as shown below for object counting and route planning.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.11683 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.11683 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.11683 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.