Hugging Face Daily Papers · · 4 min read

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We present a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference.</p>\n","updatedAt":"2026-06-18T04:22:02.034Z","author":{"_id":"64c8b0a2f3d2a59a431dbb8e","avatarUrl":"/avatars/bf130fdca7a2a0fe394558e2bf22c920.svg","fullname":"yatai ji","name":"jiyatai","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8186182975769043},"editors":["jiyatai"],"editorAvatarUrls":["/avatars/bf130fdca7a2a0fe394558e2bf22c920.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17539","authors":[{"_id":"6a33720859127a45e2c1c654","name":"Yatai Ji","hidden":false},{"_id":"6a33720859127a45e2c1c655","name":"An-Chieh Cheng","hidden":false},{"_id":"6a33720859127a45e2c1c656","name":"Yang Fu","hidden":false},{"_id":"6a33720859127a45e2c1c657","name":"Yukang Chen","hidden":false},{"_id":"6a33720859127a45e2c1c658","name":"Han Zhang","hidden":false},{"_id":"6a33720859127a45e2c1c659","name":"Zhaojing Yang","hidden":false},{"_id":"6a33720859127a45e2c1c65a","name":"Wei Huang","hidden":false},{"_id":"6a33720859127a45e2c1c65b","name":"Ka Chun Cheung","hidden":false},{"_id":"6a33720859127a45e2c1c65c","name":"Song Han","hidden":false},{"_id":"6a33720859127a45e2c1c65d","name":"Vidya Nariyambut Murali","hidden":false},{"_id":"6a33720859127a45e2c1c65e","name":"Pavlo Molchanov","hidden":false},{"_id":"6a33720859127a45e2c1c65f","name":"Jan Kautz","hidden":false},{"_id":"6a33720859127a45e2c1c660","name":"Simon See","hidden":false},{"_id":"6a33720859127a45e2c1c661","name":"Hongxu Yin","hidden":false},{"_id":"6a33720859127a45e2c1c662","name":"Ping Luo","hidden":false},{"_id":"6a33720859127a45e2c1c663","name":"Sifei Liu","hidden":false}],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Reinforcing Dual-Path Reasoning in Spatial Vision Language Models","submittedOnDailyBy":{"_id":"64c8b0a2f3d2a59a431dbb8e","avatarUrl":"/avatars/bf130fdca7a2a0fe394558e2bf22c920.svg","isPro":false,"fullname":"yatai ji","user":"jiyatai","type":"user","name":"jiyatai"},"summary":"Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.","upvotes":11,"discussionId":"6a33720859127a45e2c1c664","projectPage":"https://sr-real.github.io/","githubRepo":"https://github.com/jiyt17/SR-REAL","githubRepoAddedBy":"user","ai_summary":"A unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning through reinforcement learning, enabling robust spatial reasoning across diverse tasks and domains.","ai_keywords":["spatial VLMs","reinforcement learning","language-only reasoning","detect-then-reason","chain-of-thought supervision","region tokens","3D geometric cues","discrete center-based detection","cold-start supervised fine-tuning","policy model","accuracy rewards","format rewards","geometric alignment","joint training","positive transfer"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"642ee309ffd6084c6a61ec73","name":"HKUCDS","fullname":"University of Hong Kong","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/642ee2255bdf38b7b34db902/q9WZczVB9YltWHXFBVgzm.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64c8b0a2f3d2a59a431dbb8e","avatarUrl":"/avatars/bf130fdca7a2a0fe394558e2bf22c920.svg","isPro":false,"fullname":"yatai ji","user":"jiyatai","type":"user"},{"_id":"6686162617285e9db417e068","avatarUrl":"/avatars/281164a11494dc4658cda7a49b27a62b.svg","isPro":false,"fullname":"SimonLi","user":"SimonLi123456","type":"user"},{"_id":"64449ecd3c323e0918fac377","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64449ecd3c323e0918fac377/kUtZfBfJ6fw1sCE3GH-_H.jpeg","isPro":false,"fullname":"sidi yang","user":"StephYang","type":"user"},{"_id":"651ed7ef755e92f7f12742e6","avatarUrl":"/avatars/57a9cc189b4a59299aad6c96191b18d8.svg","isPro":false,"fullname":"yu li","user":"lyabc","type":"user"},{"_id":"659cfad775fa67e6f239ae1d","avatarUrl":"/avatars/4a6a7a16a8d56a5b97ff0bd9c5c746ce.svg","isPro":false,"fullname":"Yuxiao Ye","user":"insomnia-ye","type":"user"},{"_id":"661a59ff8858a270e6ad4481","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/3XZ0X-7HCjaw_0PpYM3Pz.png","isPro":false,"fullname":"Zhenhao Yang","user":"JeffreyYzh","type":"user"},{"_id":"643645baaa4211ef553f613c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643645baaa4211ef553f613c/svUVeTqwLf5ZurprdTOUC.jpeg","isPro":false,"fullname":"TimLeung","user":"skytliang","type":"user"},{"_id":"65c4eb7cd1dcbd30d86febec","avatarUrl":"/avatars/001c8f02e8ce794b2c21883628b2da72.svg","isPro":false,"fullname":"free-bit","user":"free-bit","type":"user"},{"_id":"6479925ab77e18dbf640bd67","avatarUrl":"/avatars/bb52ecd22ca4b49157f8668be35409e7.svg","isPro":false,"fullname":"Zhiheng Liu","user":"Johanan0528","type":"user"},{"_id":"64aea082704210bf815e7551","avatarUrl":"/avatars/5c8dc0df57596c526b2bccea21835f53.svg","isPro":false,"fullname":"Mengzhao Chen","user":"ChenMnZ","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"642ee309ffd6084c6a61ec73","name":"HKUCDS","fullname":"University of Hong Kong","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/642ee2255bdf38b7b34db902/q9WZczVB9YltWHXFBVgzm.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.17539.md","query":{}}">
Papers
arxiv:2606.17539

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Published on Jun 16
· Submitted by
yatai ji
on Jun 18
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning through reinforcement learning, enabling robust spatial reasoning across diverse tasks and domains.

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.

Community

Paper submitter about 5 hours ago

We present a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.17539
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.17539 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17539 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17539 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers