Hugging Face Daily Papers · June 18, 2026 · 5 min read

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

🚀 ViGOS: Seeing Before Reasoning for Shortcut-Resilient Multimodal OPSD\nMLLMs can reason impressively — but do they really look before they reason? 👀 Vanilla multimodal on-policy self-distillation may let the privileged reference answer leak into dense token supervision, pushing the model toward answer-compatible rationales before visual evidence is grounded.\nViGOS fixes this with a simple but powerful idea: see first, reason second. ✨ The student first writes an explicit visual description, supervised by an image-only perception teacher. Then, only after this visual prefix is in place, a privileged reasoning teacher guides the reasoning and final answer. A reference teacher is used only as a fallback for malformed rollouts — and all teachers are removed at inference time.\n📈 Results: ViGOS keeps the main OPSD gains on multimodal reasoning benchmarks while improving image-grounded behavior in shortcut-prone settings. On Qwen2.5-VL backbones, ViGOS reaches 71.97 mean Pass@5 on 3B and 75.60 on 7B, and achieves the best ViLP prior-conflict scores across all tested settings — helping models trust the image when priors are wrong. 🔥\nOne-line pitch: 🧠➡️👁️ ViGOS teaches MLLMs to ground visual evidence before reasoning — reducing shortcuts without sacrificing strong answer guidance.\n🔗 Links\n<ul>\n<li>Project Page: <a href=\"https://oedosoldier.github.io/ViGOS/\" rel=\"nofollow\">https://oedosoldier.github.io/ViGOS/</a></li>\n<li>Paper: <a href=\"https://arxiv.org/abs/2606.19120\" rel=\"nofollow\">https://arxiv.org/abs/2606.19120</a></li>\n<li>Code: <a href=\"https://github.com/OedoSoldier/ViGOS\" rel=\"nofollow\">https://github.com/OedoSoldier/ViGOS</a></li>\n<li>ViGOS-3B: <a href=\"https://huggingface.co/OedoSoldier/ViGOS-3B\">https://huggingface.co/OedoSoldier/ViGOS-3B</a></li>\n<li>ViGOS-7B: <a href=\"https://huggingface.co/OedoSoldier/ViGOS-7B\">https://huggingface.co/OedoSoldier/ViGOS-7B</a></li>\n</ul>\n","updatedAt":"2026-06-18T14:20:41.338Z","author":{"_id":"6344fbe632ccc5ca993f2587","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1679263035101-6344fbe632ccc5ca993f2587.png","fullname":"OedoSoldier","name":"OedoSoldier","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":64,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7940331101417542},"editors":["OedoSoldier"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1679263035101-6344fbe632ccc5ca993f2587.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.19120","authors":[{"_id":"6a33649d59127a45e2c1c609","user":{"_id":"6344fbe632ccc5ca993f2587","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1679263035101-6344fbe632ccc5ca993f2587.png","isPro":false,"fullname":"OedoSoldier","user":"OedoSoldier","type":"user","name":"OedoSoldier"},"name":"Sihan Wang","status":"admin_assigned","statusLastChangedAt":"2026-06-18T14:13:21.997Z","hidden":false},{"_id":"6a33649d59127a45e2c1c60a","name":"Xiyao Liu","hidden":false},{"_id":"6a33649d59127a45e2c1c60b","name":"Lianqing Liu","hidden":false},{"_id":"6a33649d59127a45e2c1c60c","name":"Zhi Han","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation","submittedOnDailyBy":{"_id":"6344fbe632ccc5ca993f2587","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1679263035101-6344fbe632ccc5ca993f2587.png","isPro":false,"fullname":"OedoSoldier","user":"OedoSoldier","type":"user","name":"OedoSoldier"},"summary":"On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.","upvotes":1,"discussionId":"6a33649d59127a45e2c1c60d","projectPage":"https://oedosoldier.github.io/ViGOS/","githubRepo":"https://github.com/OedoSoldier/ViGOS","githubRepoAddedBy":"user","ai_summary":"ViGOS is a visually grounded on-policy self-distillation framework for multimodal large language models that improves image-grounded behavior by using specialized teachers for different stages of reasoning and handling invalid rollouts.","ai_keywords":["on-policy self-distillation","multimodal large language models","visual description","privileged target","image-only perception teacher","privileged reasoning teacher","reference teacher","valid rollouts","invalid rollouts"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6344fbe632ccc5ca993f2587","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1679263035101-6344fbe632ccc5ca993f2587.png","isPro":false,"fullname":"OedoSoldier","user":"OedoSoldier","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.19120.md","query":{}}">

Papers

arxiv:2606.19120

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Published on Jun 17

· Submitted by

OedoSoldier on Jun 18

Upvote

Authors:

Sihan Wang ,

Abstract

ViGOS is a visually grounded on-policy self-distillation framework for multimodal large language models that improves image-grounded behavior by using specialized teachers for different stages of reasoning and handling invalid rollouts.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

OedoSoldier

Paper author Paper submitter about 12 hours ago

This comment has been hidden (marked as Resolved)

OedoSoldier

Paper author Paper submitter about 2 hours ago

🚀 ViGOS: Seeing Before Reasoning for Shortcut-Resilient Multimodal OPSD

MLLMs can reason impressively — but do they really look before they reason? 👀
Vanilla multimodal on-policy self-distillation may let the privileged reference answer leak into dense token supervision, pushing the model toward answer-compatible rationales before visual evidence is grounded.

ViGOS fixes this with a simple but powerful idea: see first, reason second. ✨
The student first writes an explicit visual description, supervised by an image-only perception teacher. Then, only after this visual prefix is in place, a privileged reasoning teacher guides the reasoning and final answer. A reference teacher is used only as a fallback for malformed rollouts — and all teachers are removed at inference time.

📈 Results: ViGOS keeps the main OPSD gains on multimodal reasoning benchmarks while improving image-grounded behavior in shortcut-prone settings. On Qwen2.5-VL backbones, ViGOS reaches 71.97 mean Pass@5 on 3B and 75.60 on 7B, and achieves the best ViLP prior-conflict scores across all tested settings — helping models trust the image when priors are wrong. 🔥

One-line pitch:
🧠➡️👁️ ViGOS teaches MLLMs to ground visual evidence before reasoning — reducing shortcuts without sacrificing strong answer guidance.

🔗 Links

Project Page: https://oedosoldier.github.io/ViGOS/
Paper: https://arxiv.org/abs/2606.19120
Code: https://github.com/OedoSoldier/ViGOS
ViGOS-3B: https://huggingface.co/OedoSoldier/ViGOS-3B
ViGOS-7B: https://huggingface.co/OedoSoldier/ViGOS-7B

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.19120

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.19120 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.19120 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Abstract

Community

Models citing this paper 2

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers