Hugging Face Daily Papers · June 4, 2026 · 4 min read

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

ReasonMatch turns wide-baseline matching into a verifiable RL task for MLLMs. An 8B model trained with DCRL hits 70.5 F1 and beats GPT-5-mini on ReasonMatch-Bench—nice evidence that geometric supervision + RL can unlock spatial reasoning without CoT labels.</p>\n","updatedAt":"2026-06-04T01:46:23.377Z","author":{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","fullname":"zhumuzhi","name":"Z-MU-Z","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8494170308113098},"editors":["Z-MU-Z"],"editorAvatarUrls":["/avatars/37d4fefbcc19f091dccffefec9706de2.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.03577","authors":[{"_id":"6a202b5d15100c5272a84160","name":"Hao Zhong","hidden":false},{"_id":"6a202b5d15100c5272a84161","name":"Muzhi Zhu","hidden":false},{"_id":"6a202b5d15100c5272a84162","name":"Shenyan Zeng","hidden":false},{"_id":"6a202b5d15100c5272a84163","name":"Anzhou Li","hidden":false},{"_id":"6a202b5d15100c5272a84164","name":"Cong Chen","hidden":false},{"_id":"6a202b5d15100c5272a84165","name":"Hua Geng","hidden":false},{"_id":"6a202b5d15100c5272a84166","name":"Duochao Shi","hidden":false},{"_id":"6a202b5d15100c5272a84167","name":"Wentao Ye","hidden":false},{"_id":"6a202b5d15100c5272a84168","name":"Tao Lin","hidden":false},{"_id":"6a202b5d15100c5272a84169","name":"Hao Chen","hidden":false},{"_id":"6a202b5d15100c5272a8416a","name":"Chunhua Shen","hidden":false}],"publishedAt":"2026-06-02T00:00:00.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching","submittedOnDailyBy":{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","isPro":false,"fullname":"zhumuzhi","user":"Z-MU-Z","type":"user","name":"Z-MU-Z"},"summary":"Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.","upvotes":12,"discussionId":"6a202b5d15100c5272a8416e","projectPage":"https://aim-uofa.github.io/reasonmatch/","githubRepo":"https://github.com/aim-uofa/ReasonMatch","githubRepoAddedBy":"user","ai_summary":"Wide-baseline matching presents a challenging spatial reasoning testbed for multimodal large language models, requiring systematic evaluation and training frameworks that current models lack, prompting the introduction of ReasonMatch-Bench and Dynamic Correspondence Reinforcement Learning to improve performance.","ai_keywords":["wide-baseline matching","multimodal large language models","spatial reasoning","ReasonMatch-Bench","data-generation pipeline","RGB-D videos","SfM reconstructions","Dynamic Correspondence Reinforcement Learning","Image-Level Viewpoint Progression","Point-Level Correspondence Curriculum"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":7,"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","isPro":false,"fullname":"zhumuzhi","user":"Z-MU-Z","type":"user"},{"_id":"64b736d06ab5d14ca7f05ed6","avatarUrl":"/avatars/78a7ebf302a615c5f460178ae437bd0e.svg","isPro":false,"fullname":"Sam Zeng","user":"Samzengsy","type":"user"},{"_id":"66a9a7468065cc9d3cd7ca67","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/RsJujEPq58fNZ7jIUS4vC.jpeg","isPro":false,"fullname":"Su","user":"Yusux","type":"user"},{"_id":"6387405785f406f24f53efc7","avatarUrl":"/avatars/55b9b353a9eb57289b78e75217cadb89.svg","isPro":false,"fullname":"Hao Chen","user":"yanjiamao","type":"user"},{"_id":"674ab2156bfa8d3e4f780e4c","avatarUrl":"/avatars/35278a9bcce4572b2a1a82d8624da441.svg","isPro":false,"fullname":"Yiduo Jia","user":"HeiXiong620","type":"user"},{"_id":"6a097288920a02fcb4d1483f","avatarUrl":"/avatars/4c3fd4e8cc65247a065ba1897d3a1465.svg","isPro":false,"fullname":"Canyu Zhao","user":"volcverse","type":"user"},{"_id":"65a97147cb5b4fb08e716afd","avatarUrl":"/avatars/7ddd4eb85d472cf42f71a3c34d659582.svg","isPro":false,"fullname":"Wentao Ye","user":"darklight03","type":"user"},{"_id":"638ca3bfd138669379c90626","avatarUrl":"/avatars/27f93abbea5a26b27e4ef33c93a5d945.svg","isPro":false,"fullname":"JInatao rong","user":"euminds","type":"user"},{"_id":"6320349dfb307b12b2e7b735","avatarUrl":"/avatars/8d65531d6c601b57689c2b0de73d580f.svg","isPro":false,"fullname":"Jin-Chuan Shi","user":"Chrisss","type":"user"},{"_id":"6652f4dff72c9a37ceab9825","avatarUrl":"/avatars/ad05f253f9fd647af7249ba90c5e5b78.svg","isPro":false,"fullname":"lee","user":"llysglz","type":"user"},{"_id":"66935bdc5489e4f73c76bc7b","avatarUrl":"/avatars/129d1e86bbaf764b507501f4feb177db.svg","isPro":false,"fullname":"Abidoye Aanuoluwapo","user":"Aanuoluwapo65","type":"user"},{"_id":"69bcf3ffbf15b72078956ff7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/62CKd1ssLRIdK3TnacBtY.png","isPro":false,"fullname":"배채원","user":"anthonyyoung39","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61bac2af530e5c78d7b99667","name":"zju","fullname":"Zhejiang University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e1058e9fcf41d740b69966d/7G1xjlxwCdMEmKcxNR0n5.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.03577.md"}">

Papers

arxiv:2606.03577

Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching

Published on Jun 2

· Submitted by

zhumuzhi on Jun 4

Zhejiang University

Upvote

Authors:

Abstract

Wide-baseline matching presents a challenging spatial reasoning testbed for multimodal large language models, requiring systematic evaluation and training frameworks that current models lack, prompting the introduction of ReasonMatch-Bench and Dynamic Correspondence Reinforcement Learning to improve performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs lack systematic evaluation and training frameworks for these capabilities. We introduce ReasonMatch-Bench, a benchmark stratified by viewpoint displacement and matching granularity across indoor, outdoor, and object-centric scenarios, and show that current MLLMs still struggle with fine-grained wide-baseline correspondence: on a difficult 90-sample subset, human annotators achieve 84.0 F1, while the best existing baseline reaches 37.2. To bridge this gap, we build a scalable data-generation pipeline that automatically extracts wide-baseline view pairs from large-scale video-3D corpora, including RGB-D videos and SfM reconstructions, yielding diverse and verifiable supervision. We further propose Dynamic Correspondence Reinforcement Learning (DCRL), which combines Image-Level Viewpoint Progression and Point-Level Correspondence Curriculum to improve WBM training through verifiable rewards without explicit CoT supervision. Extensive experiments show that DCRL substantially improves ReasonMatch-Bench and transfers to related spatial benchmarks, while maintaining general visual understanding performance with modest gains on several benchmarks.