Hugging Face Daily Papers · June 18, 2026 · 4 min read

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We present a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference.</p>\n","updatedAt":"2026-06-18T04:22:02.034Z","author":{"_id":"64c8b0a2f3d2a59a431dbb8e","avatarUrl":"/avatars/bf130fdca7a2a0fe394558e2bf22c920.svg","fullname":"yatai ji","name":"jiyatai","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8186182975769043},"editors":["jiyatai"],"editorAvatarUrls":["/avatars/bf130fdca7a2a0fe394558e2bf22c920.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17539","authors":[{"_id":"6a33720859127a45e2c1c654","name":"Yatai Ji","hidden":false},{"_id":"6a33720859127a45e2c1c655","name":"An-Chieh Cheng","hidden":false},{"_id":"6a33720859127a45e2c1c656","name":"Yang Fu","hidden":false},{"_id":"6a33720859127a45e2c1c657","name":"Yukang Chen","hidden":false},{"_id":"6a33720859127a45e2c1c658","name":"Han Zhang","hidden":false},{"_id":"6a33720859127a45e2c1c659","name":"Zhaojing Yang","hidden":false},{"_id":"6a33720859127a45e2c1c65a","name":"Wei Huang","hidden":false},{"_id":"6a33720859127a45e2c1c65b","name":"Ka Chun Cheung","hidden":false},{"_id":"6a33720859127a45e2c1c65c","name":"Song Han","hidden":false},{"_id":"6a33720859127a45e2c1c65d","name":"Vidya Nariyambut Murali","hidden":false},{"_id":"6a33720859127a45e2c1c65e","name":"Pavlo Molchanov","hidden":false},{"_id":"6a33720859127a45e2c1c65f","name":"Jan Kautz","hidden":false},{"_id":"6a33720859127a45e2c1c660","name":"Simon See","hidden":false},{"_id":"6a33720859127a45e2c1c661","name":"Hongxu Yin","hidden":false},{"_id":"6a33720859127a45e2c1c662","name":"Ping Luo","hidden":false},{"_id":"6a33720859127a45e2c1c663","name":"Sifei Liu","hidden":false}],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Reinforcing Dual-Path Reasoning in Spatial Vision Language Models","submittedOnDailyBy":{"_id":"64c8b0a2f3d2a59a431dbb8e","avatarUrl":"/avatars/bf130fdca7a2a0fe394558e2bf22c920.svg","isPro":false,"fullname":"yatai ji","user":"jiyatai","type":"user","name":"jiyatai"},"summary":"Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.","upvotes":11,"discussionId":"6a33720859127a45e2c1c664","projectPage":"https://sr-real.github.io/","githubRepo":"https://github.com/jiyt17/SR-REAL","githubRepoAddedBy":"user","ai_summary":"A unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning through reinforcement learning, enabling robust spatial reasoning across diverse tasks and domains.","ai_keywords":["spatial VLMs","reinforcement learning","language-only reasoning","detect-then-reason","chain-of-thought supervision","region tokens","3D geometric cues","discrete center-based detection","cold-start supervised fine-tuning","policy model","accuracy rewards","format rewards","geometric alignment","joint training","positive transfer"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"642ee309ffd6084c6a61ec73","name":"HKUCDS","fullname":"University of Hong Kong","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/642ee2255bdf38b7b34db902/q9WZczVB9YltWHXFBVgzm.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64c8b0a2f3d2a59a431dbb8e","avatarUrl":"/avatars/bf130fdca7a2a0fe394558e2bf22c920.svg","isPro":false,"fullname":"yatai ji","user":"jiyatai","type":"user"},{"_id":"6686162617285e9db417e068","avatarUrl":"/avatars/281164a11494dc4658cda7a49b27a62b.svg","isPro":false,"fullname":"SimonLi","user":"SimonLi123456","type":"user"},{"_id":"64449ecd3c323e0918fac377","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64449ecd3c323e0918fac377/kUtZfBfJ6fw1sCE3GH-_H.jpeg","isPro":false,"fullname":"sidi yang","user":"StephYang","type":"user"},{"_id":"651ed7ef755e92f7f12742e6","avatarUrl":"/avatars/57a9cc189b4a59299aad6c96191b18d8.svg","isPro":false,"fullname":"yu li","user":"lyabc","type":"user"},{"_id":"659cfad775fa67e6f239ae1d","avatarUrl":"/avatars/4a6a7a16a8d56a5b97ff0bd9c5c746ce.svg","isPro":false,"fullname":"Yuxiao Ye","user":"insomnia-ye","type":"user"},{"_id":"661a59ff8858a270e6ad4481","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/3XZ0X-7HCjaw_0PpYM3Pz.png","isPro":false,"fullname":"Zhenhao Yang","user":"JeffreyYzh","type":"user"},{"_id":"643645baaa4211ef553f613c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643645baaa4211ef553f613c/svUVeTqwLf5ZurprdTOUC.jpeg","isPro":false,"fullname":"TimLeung","user":"skytliang","type":"user"},{"_id":"65c4eb7cd1dcbd30d86febec","avatarUrl":"/avatars/001c8f02e8ce794b2c21883628b2da72.svg","isPro":false,"fullname":"free-bit","user":"free-bit","type":"user"},{"_id":"6479925ab77e18dbf640bd67","avatarUrl":"/avatars/bb52ecd22ca4b49157f8668be35409e7.svg","isPro":false,"fullname":"Zhiheng Liu","user":"Johanan0528","type":"user"},{"_id":"64aea082704210bf815e7551","avatarUrl":"/avatars/5c8dc0df57596c526b2bccea21835f53.svg","isPro":false,"fullname":"Mengzhao Chen","user":"ChenMnZ","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"642ee309ffd6084c6a61ec73","name":"HKUCDS","fullname":"University of Hong Kong","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/642ee2255bdf38b7b34db902/q9WZczVB9YltWHXFBVgzm.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.17539.md","query":{}}">

Papers

arxiv:2606.17539

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Published on Jun 16

· Submitted by

yatai ji on Jun 18

University of Hong Kong

Upvote

Authors:

Abstract

A unified framework for spatial vision-language models that combines linguistic deduction and 3D geometric reasoning through reinforcement learning, enabling robust spatial reasoning across diverse tasks and domains.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

jiyatai

Paper submitter about 5 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.17539

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.17539 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17539 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17539 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers