ReasonMatch turns wide-baseline matching into a verifiable RL task for MLLMs. An 8B model trained with DCRL hits 70.5 F1 and beats GPT-5-mini on ReasonMatch-Bench—nice evidence that geometric supervision + RL can unlock spatial reasoning without CoT labels.</p>\n","updatedAt":"2026-06-04T01:46:23.377Z","author":{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","fullname":"zhumuzhi","name":"Z-MU-Z","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8494170308113098},"editors":["Z-MU-Z"],"editorAvatarUrls":["/avatars/37d4fefbcc19f091dccffefec9706de2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.03577","authors":[{"_id":"6a202b5d15100c5272a84160","name":"Hao Zhong","hidden":false},{"_id":"6a202b5d15100c5272a84161","name":"Muzhi Zhu","hidden":false},{"_id":"6a202b5d15100c5272a84162","name":"Shenyan Zeng","hidden":false},{"_id":"6a202b5d15100c5272a84163","name":"Anzhou Li","hidden":false},{"_id":"6a202b5d15100c5272a84164","name":"Cong Chen","hidden":false},{"_id":"6a202b5d15100c5272a84165","name":"Hua Geng","hidden":false},{"_id":"6a202b5d15100c5272a84166","name":"Duochao Shi","hidden":false},{"_id":"6a202b5d15100c5272a84167","name":"Wentao Ye","hidden":false},{"_id":"6a202b5d15100c5272a84168","name":"Tao Lin","hidden":false},{"_id":"6a202b5d15100c5272a84169","name":"Hao Chen","hidden":false},{"_id":"6a202b5d15100c5272a8416a","name":"Chunhua Shen","hidden":false}],"publishedAt":"2026-06-02T00:00:00.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching","submittedOnDailyBy":{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","isPro":false,"fullname":"zhumuzhi","user":"Z-MU-Z","type":"user","name":"Z-MU-Z"},"summary":"Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.","upvotes":12,"discussionId":"6a202b5d15100c5272a8416e","projectPage":"https://aim-uofa.github.io/reasonmatch/","githubRepo":"https://github.com/aim-uofa/ReasonMatch","githubRepoAddedBy":"user","ai_summary":"Wide-baseline matching presents a challenging spatial reasoning testbed for multimodal large language models, requiring systematic evaluation and training frameworks that current models lack, prompting the introduction of ReasonMatch-Bench and Dynamic Correspondence Reinforcement Learning to improve performance.","ai_keywords":["wide-baseline matching","multimodal large language models","spatial reasoning","ReasonMatch-Bench","data-generation pipeline","RGB-D videos","SfM reconstructions","Dynamic Correspondence Reinforcement Learning","Image-Level Viewpoint Progression","Point-Level Correspondence Curriculum"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":7,"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","isPro":false,"fullname":"zhumuzhi","user":"Z-MU-Z","type":"user"},{"_id":"64b736d06ab5d14ca7f05ed6","avatarUrl":"/avatars/78a7ebf302a615c5f460178ae437bd0e.svg","isPro":false,"fullname":"Sam Zeng","user":"Samzengsy","type":"user"},{"_id":"66a9a7468065cc9d3cd7ca67","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/RsJujEPq58fNZ7jIUS4vC.jpeg","isPro":false,"fullname":"Su","user":"Yusux","type":"user"},{"_id":"6387405785f406f24f53efc7","avatarUrl":"/avatars/55b9b353a9eb57289b78e75217cadb89.svg","isPro":false,"fullname":"Hao Chen","user":"yanjiamao","type":"user"},{"_id":"674ab2156bfa8d3e4f780e4c","avatarUrl":"/avatars/35278a9bcce4572b2a1a82d8624da441.svg","isPro":false,"fullname":"Yiduo Jia","user":"HeiXiong620","type":"user"},{"_id":"6a097288920a02fcb4d1483f","avatarUrl":"/avatars/4c3fd4e8cc65247a065ba1897d3a1465.svg","isPro":false,"fullname":"Canyu Zhao","user":"volcverse","type":"user"},{"_id":"65a97147cb5b4fb08e716afd","avatarUrl":"/avatars/7ddd4eb85d472cf42f71a3c34d659582.svg","isPro":false,"fullname":"Wentao Ye","user":"darklight03","type":"user"},{"_id":"638ca3bfd138669379c90626","avatarUrl":"/avatars/27f93abbea5a26b27e4ef33c93a5d945.svg","isPro":false,"fullname":"JInatao rong","user":"euminds","type":"user"},{"_id":"6320349dfb307b12b2e7b735","avatarUrl":"/avatars/8d65531d6c601b57689c2b0de73d580f.svg","isPro":false,"fullname":"Jin-Chuan Shi","user":"Chrisss","type":"user"},{"_id":"6652f4dff72c9a37ceab9825","avatarUrl":"/avatars/ad05f253f9fd647af7249ba90c5e5b78.svg","isPro":false,"fullname":"lee","user":"llysglz","type":"user"},{"_id":"66935bdc5489e4f73c76bc7b","avatarUrl":"/avatars/129d1e86bbaf764b507501f4feb177db.svg","isPro":false,"fullname":"Abidoye Aanuoluwapo","user":"Aanuoluwapo65","type":"user"},{"_id":"69bcf3ffbf15b72078956ff7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/62CKd1ssLRIdK3TnacBtY.png","isPro":false,"fullname":"배채원","user":"anthonyyoung39","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.03577.md"}">
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
Authors: ,
,
,
,
,
,
,
,
,
,
Abstract
Wide-baseline matching presents a challenging spatial reasoning testbed for multimodal large language models, requiring systematic evaluation and training frameworks that current models lack, prompting the introduction of ReasonMatch-Bench and Dynamic Correspondence Reinforcement Learning to improve performance.
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.
Community
ReasonMatch turns wide-baseline matching into a verifiable RL task for MLLMs. An 8B model trained with DCRL hits 70.5 F1 and beats GPT-5-mini on ReasonMatch-Bench—nice evidence that geometric supervision + RL can unlock spatial reasoning without CoT labels.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.03577 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.03577 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.