Hugging Face Daily Papers · · 3 min read

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

DR-MV3D introduces a map-grounded dense reward framework for multi-view 3D visual question answering, improving cross-view spatial reasoning by supervising global map construction, view-trajectory planning, and egocentric grounding with verifiable process-level rewards.</p>\n","updatedAt":"2026-06-23T04:04:47.011Z","author":{"_id":"633afee39f5846dc4301c96b","avatarUrl":"/avatars/4fe5707397ea6b719ae7bd6695b8d34b.svg","fullname":"Jiho Choi","name":"jihochoi","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8305907249450684},"editors":["jihochoi"],"editorAvatarUrls":["/avatars/4fe5707397ea6b719ae7bd6695b8d34b.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.23557","authors":[{"_id":"6a3a0398fdcd3514343bb5ae","name":"Jiho Choi","hidden":false},{"_id":"6a3a0398fdcd3514343bb5af","user":{"_id":"6496dd563540608633191869","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6496dd563540608633191869/LkJbdJXMcWB-yYucGK8rG.jpeg","isPro":false,"fullname":"Griffin Sunho (Seonho) Lee","user":"Glanceyes","type":"user","name":"Glanceyes"},"name":"Seonho Lee","status":"claimed_verified","statusLastChangedAt":"2026-06-23T14:01:28.401Z","hidden":false},{"_id":"6a3a0398fdcd3514343bb5b0","name":"Seojeong Park","hidden":false},{"_id":"6a3a0398fdcd3514343bb5b1","name":"Hyunjung Shim","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/633afee39f5846dc4301c96b/LEsxVweutl5ZObzl8KcMc.mp4"],"publishedAt":"2026-06-22T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views","submittedOnDailyBy":{"_id":"633afee39f5846dc4301c96b","avatarUrl":"/avatars/4fe5707397ea6b719ae7bd6695b8d34b.svg","isPro":false,"fullname":"Jiho Choi","user":"jihochoi","type":"user","name":"jihochoi"},"summary":"Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.","upvotes":4,"discussionId":"6a3a0398fdcd3514343bb5b2","projectPage":"https://dr-mv3d.github.io/","githubRepo":"https://github.com/kaist-cvml/DR-MV3D","githubRepoAddedBy":"user","ai_summary":"DR-MV3D presents a map-grounded learning framework with dense rewards to improve multi-view 3D visual question answering through global map construction, view-trajectory planning, and egocentric grounding.","ai_keywords":["MV3D-VQA","multimodal LLMs","sparse supervision","cross-view reasoning","view selection","dense reward","map-grounded learning","allocentric global map construction","question-conditioned view-trajectory planning","egocentric grounding","policy optimization","GRPO","3D vision foundation models","VGGT","SAM3"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6496dd563540608633191869","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6496dd563540608633191869/LkJbdJXMcWB-yYucGK8rG.jpeg","isPro":false,"fullname":"Griffin Sunho (Seonho) Lee","user":"Glanceyes","type":"user"},{"_id":"631c386bc73939ffc0716a37","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662793811119-noauth.jpeg","isPro":false,"fullname":"SeongWan Kim","user":"idgmatrix","type":"user"},{"_id":"633afee39f5846dc4301c96b","avatarUrl":"/avatars/4fe5707397ea6b719ae7bd6695b8d34b.svg","isPro":false,"fullname":"Jiho Choi","user":"jihochoi","type":"user"},{"_id":"6a2da6c8ca070ee12c6e396c","avatarUrl":"/avatars/0355287dcabaa67dbc7f0b10b87451f9.svg","isPro":false,"fullname":"Joe Mama","user":"JoeMama123123123","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6475760c33192631bad2bb38","name":"kaist-ai","fullname":"KAIST AI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6469949654873f0043b09c22/aaZFiyXe1qR-Dmy_xq67m.png"},"query":{}}">
Papers
arxiv:2606.23557

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Published on Jun 22
· Submitted by
Jiho Choi
on Jun 23
Authors:
,
,

Abstract

DR-MV3D presents a map-grounded learning framework with dense rewards to improve multi-view 3D visual question answering through global map construction, view-trajectory planning, and egocentric grounding.

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.

Community

Paper submitter about 21 hours ago

DR-MV3D introduces a map-grounded dense reward framework for multi-view 3D visual question answering, improving cross-view spatial reasoning by supervising global map construction, view-trajectory planning, and egocentric grounding with verifiable process-level rewards.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.23557 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.23557 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.23557 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers