Hugging Face Daily Papers · June 1, 2026 · 3 min read

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

A new paradigm for \"Thinking with primitives.\"</p>\n","updatedAt":"2026-06-01T07:48:00.782Z","author":{"_id":"65434daa5a36a8774d0e2271","avatarUrl":"/avatars/abc3ddec72072121130d581e32cd9045.svg","fullname":"Allen Zhang","name":"allencbzhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9152861833572388},"editors":["allencbzhang"],"editorAvatarUrls":["/avatars/abc3ddec72072121130d581e32cd9045.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31096","authors":[{"_id":"6a1cfd7c808ddbc3c7d4351f","name":"Chang-Bin Zhang","hidden":false},{"_id":"6a1cfd7c808ddbc3c7d43520","name":"Yujie Zhong","hidden":false},{"_id":"6a1cfd7c808ddbc3c7d43521","name":"Qiang Zhang","hidden":false},{"_id":"6a1cfd7c808ddbc3c7d43522","name":"Kai Han","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning","submittedOnDailyBy":{"_id":"65434daa5a36a8774d0e2271","avatarUrl":"/avatars/abc3ddec72072121130d581e32cd9045.svg","isPro":false,"fullname":"Allen Zhang","user":"allencbzhang","type":"user","name":"allencbzhang"},"summary":"While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.","upvotes":1,"discussionId":"6a1cfd7d808ddbc3c7d43523","projectPage":"https://visual-ai.github.io/ivgr/","githubRepo":"https://github.com/Visual-AI/iVGR","githubRepoAddedBy":"user","ai_summary":"A reinforcement learning framework called iVGR is introduced to transfer visual localization capabilities into textual reasoning, improving fine-grained perception in multimodal language models without requiring explicit visual grounding during inference.","ai_keywords":["Chain-of-Thought","multimodal large language models","visually grounded reasoning","reinforcement learning","dual-stream training","consistency reward","fine-grained perception"],"githubStars":4,"organization":{"_id":"642ee309ffd6084c6a61ec73","name":"HKUCDS","fullname":"University of Hong Kong","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/642ee2255bdf38b7b34db902/q9WZczVB9YltWHXFBVgzm.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65434daa5a36a8774d0e2271","avatarUrl":"/avatars/abc3ddec72072121130d581e32cd9045.svg","isPro":false,"fullname":"Allen Zhang","user":"allencbzhang","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"642ee309ffd6084c6a61ec73","name":"HKUCDS","fullname":"University of Hong Kong","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/642ee2255bdf38b7b34db902/q9WZczVB9YltWHXFBVgzm.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31096.md"}">

Papers

arxiv:2605.31096

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Published on May 29

· Submitted by

Allen Zhang on Jun 1

University of Hong Kong

Upvote

Authors:

Abstract

A reinforcement learning framework called iVGR is introduced to transfer visual localization capabilities into textual reasoning, improving fine-grained perception in multimodal language models without requiring explicit visual grounding during inference.

AI-generated summary

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

allencbzhang

Paper submitter about 3 hours ago

A new paradigm for "Thinking with primitives."

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.31096

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.31096 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31096 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31096 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers