Hugging Face Daily Papers · June 23, 2026 · 3 min read

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

DR-MV3D introduces a map-grounded dense reward framework for multi-view 3D visual question answering, improving cross-view spatial reasoning by supervising global map construction, view-trajectory planning, and egocentric grounding with verifiable process-level rewards.</p>\n","updatedAt":"2026-06-23T04:04:47.011Z","author":{"_id":"633afee39f5846dc4301c96b","avatarUrl":"/avatars/4fe5707397ea6b719ae7bd6695b8d34b.svg","fullname":"Jiho Choi","name":"jihochoi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8305907249450684},"editors":["jihochoi"],"editorAvatarUrls":["/avatars/4fe5707397ea6b719ae7bd6695b8d34b.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.23557","authors":[{"_id":"6a3a0398fdcd3514343bb5ae","name":"Jiho Choi","hidden":false},{"_id":"6a3a0398fdcd3514343bb5af","user":{"_id":"6496dd563540608633191869","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6496dd563540608633191869/LkJbdJXMcWB-yYucGK8rG.jpeg","isPro":false,"fullname":"Griffin Sunho (Seonho) Lee","user":"Glanceyes","type":"user","name":"Glanceyes"},"name":"Seonho Lee","status":"claimed_verified","statusLastChangedAt":"2026-06-23T14:01:28.401Z","hidden":false},{"_id":"6a3a0398fdcd3514343bb5b0","name":"Seojeong Park","hidden":false},{"_id":"6a3a0398fdcd3514343bb5b1","name":"Hyunjung Shim","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/633afee39f5846dc4301c96b/LEsxVweutl5ZObzl8KcMc.mp4"],"publishedAt":"2026-06-22T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views","submittedOnDailyBy":{"_id":"633afee39f5846dc4301c96b","avatarUrl":"/avatars/4fe5707397ea6b719ae7bd6695b8d34b.svg","isPro":false,"fullname":"Jiho Choi","user":"jihochoi","type":"user","name":"jihochoi"},"summary":"Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.","upvotes":4,"discussionId":"6a3a0398fdcd3514343bb5b2","projectPage":"https://dr-mv3d.github.io/","githubRepo":"https://github.com/kaist-cvml/DR-MV3D","githubRepoAddedBy":"user","ai_summary":"DR-MV3D presents a map-grounded learning framework with dense rewards to improve multi-view 3D visual question answering through global map construction, view-trajectory planning, and egocentric grounding.","ai_keywords":["MV3D-VQA","multimodal LLMs","sparse supervision","cross-view reasoning","view selection","dense reward","map-grounded learning","allocentric global map construction","question-conditioned view-trajectory planning","egocentric grounding","policy optimization","GRPO","3D vision foundation models","VGGT","SAM3"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6496dd563540608633191869","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6496dd563540608633191869/LkJbdJXMcWB-yYucGK8rG.jpeg","isPro":false,"fullname":"Griffin Sunho (Seonho) Lee","user":"Glanceyes","type":"user"},{"_id":"631c386bc73939ffc0716a37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662793811119-noauth.jpeg","isPro":false,"fullname":"SeongWan Kim","user":"idgmatrix","type":"user"},{"_id":"633afee39f5846dc4301c96b","avatarUrl":"/avatars/4fe5707397ea6b719ae7bd6695b8d34b.svg","isPro":false,"fullname":"Jiho Choi","user":"jihochoi","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"},"query":{}}">

Papers

arxiv:2606.23557

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Published on Jun 22

· Submitted by

Jiho Choi on Jun 23

KAIST AI

Upvote

Authors:

Seonho Lee ,

Abstract

DR-MV3D presents a map-grounded learning framework with dense rewards to improve multi-view 3D visual question answering through global map construction, view-trajectory planning, and egocentric grounding.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

jihochoi

Paper submitter about 21 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.23557 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.23557 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.23557 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers