🚀 <strong>ViGOS: Seeing Before Reasoning for Shortcut-Resilient Multimodal OPSD</strong></p>\n<p>MLLMs can reason impressively — but do they really <em>look</em> before they reason? 👀<br>Vanilla multimodal on-policy self-distillation may let the privileged reference answer leak into dense token supervision, pushing the model toward answer-compatible rationales before visual evidence is grounded.</p>\n<p><strong>ViGOS</strong> fixes this with a simple but powerful idea: <strong>see first, reason second</strong>. ✨<br>The student first writes an explicit visual description, supervised by an <strong>image-only perception teacher</strong>. Then, only after this visual prefix is in place, a <strong>privileged reasoning teacher</strong> guides the reasoning and final answer. A reference teacher is used only as a fallback for malformed rollouts — and all teachers are removed at inference time.</p>\n<p>📈 <strong>Results:</strong> ViGOS keeps the main OPSD gains on multimodal reasoning benchmarks while improving image-grounded behavior in shortcut-prone settings. On Qwen2.5-VL backbones, ViGOS reaches <strong>71.97 mean Pass@5 on 3B</strong> and <strong>75.60 on 7B</strong>, and achieves the best ViLP prior-conflict scores across all tested settings — helping models trust the image when priors are wrong. 🔥</p>\n<p><strong>One-line pitch:</strong><br>🧠➡️👁️ <strong>ViGOS teaches MLLMs to ground visual evidence before reasoning — reducing shortcuts without sacrificing strong answer guidance.</strong></p>\n<p>🔗 <strong>Links</strong></p>\n<ul>\n<li><strong>Project Page:</strong> <a href=\"https://oedosoldier.github.io/ViGOS/\" rel=\"nofollow\">https://oedosoldier.github.io/ViGOS/</a></li>\n<li><strong>Paper:</strong> <a href=\"https://arxiv.org/abs/2606.19120\" rel=\"nofollow\">https://arxiv.org/abs/2606.19120</a></li>\n<li><strong>Code:</strong> <a href=\"https://github.com/OedoSoldier/ViGOS\" rel=\"nofollow\">https://github.com/OedoSoldier/ViGOS</a></li>\n<li><strong>ViGOS-3B:</strong> <a href=\"https://huggingface.co/OedoSoldier/ViGOS-3B\">https://huggingface.co/OedoSoldier/ViGOS-3B</a></li>\n<li><strong>ViGOS-7B:</strong> <a href=\"https://huggingface.co/OedoSoldier/ViGOS-7B\">https://huggingface.co/OedoSoldier/ViGOS-7B</a></li>\n</ul>\n","updatedAt":"2026-06-18T14:20:41.338Z","author":{"_id":"6344fbe632ccc5ca993f2587","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1679263035101-6344fbe632ccc5ca993f2587.png","fullname":"OedoSoldier","name":"OedoSoldier","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":64,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7940331101417542},"editors":["OedoSoldier"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1679263035101-6344fbe632ccc5ca993f2587.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.19120","authors":[{"_id":"6a33649d59127a45e2c1c609","user":{"_id":"6344fbe632ccc5ca993f2587","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1679263035101-6344fbe632ccc5ca993f2587.png","isPro":false,"fullname":"OedoSoldier","user":"OedoSoldier","type":"user","name":"OedoSoldier"},"name":"Sihan Wang","status":"admin_assigned","statusLastChangedAt":"2026-06-18T14:13:21.997Z","hidden":false},{"_id":"6a33649d59127a45e2c1c60a","name":"Xiyao Liu","hidden":false},{"_id":"6a33649d59127a45e2c1c60b","name":"Lianqing Liu","hidden":false},{"_id":"6a33649d59127a45e2c1c60c","name":"Zhi Han","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation","submittedOnDailyBy":{"_id":"6344fbe632ccc5ca993f2587","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1679263035101-6344fbe632ccc5ca993f2587.png","isPro":false,"fullname":"OedoSoldier","user":"OedoSoldier","type":"user","name":"OedoSoldier"},"summary":"On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.","upvotes":1,"discussionId":"6a33649d59127a45e2c1c60d","projectPage":"https://oedosoldier.github.io/ViGOS/","githubRepo":"https://github.com/OedoSoldier/ViGOS","githubRepoAddedBy":"user","ai_summary":"ViGOS is a visually grounded on-policy self-distillation framework for multimodal large language models that improves image-grounded behavior by using specialized teachers for different stages of reasoning and handling invalid rollouts.","ai_keywords":["on-policy self-distillation","multimodal large language models","visual description","privileged target","image-only perception teacher","privileged reasoning teacher","reference teacher","valid rollouts","invalid rollouts"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6344fbe632ccc5ca993f2587","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1679263035101-6344fbe632ccc5ca993f2587.png","isPro":false,"fullname":"OedoSoldier","user":"OedoSoldier","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.19120.md","query":{}}">
Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
Abstract
ViGOS is a visually grounded on-policy self-distillation framework for multimodal large language models that improves image-grounded behavior by using specialized teachers for different stages of reasoning and handling invalid rollouts.
On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.
Community
This comment has been hidden (marked as Resolved) 🚀 ViGOS: Seeing Before Reasoning for Shortcut-Resilient Multimodal OPSD
MLLMs can reason impressively — but do they really look before they reason? 👀
Vanilla multimodal on-policy self-distillation may let the privileged reference answer leak into dense token supervision, pushing the model toward answer-compatible rationales before visual evidence is grounded.
ViGOS fixes this with a simple but powerful idea: see first, reason second. ✨
The student first writes an explicit visual description, supervised by an image-only perception teacher. Then, only after this visual prefix is in place, a privileged reasoning teacher guides the reasoning and final answer. A reference teacher is used only as a fallback for malformed rollouts — and all teachers are removed at inference time.
📈 Results: ViGOS keeps the main OPSD gains on multimodal reasoning benchmarks while improving image-grounded behavior in shortcut-prone settings. On Qwen2.5-VL backbones, ViGOS reaches 71.97 mean Pass@5 on 3B and 75.60 on 7B, and achieves the best ViLP prior-conflict scores across all tested settings — helping models trust the image when priors are wrong. 🔥
One-line pitch:
🧠➡️👁️ ViGOS teaches MLLMs to ground visual evidence before reasoning — reducing shortcuts without sacrificing strong answer guidance.
🔗 Links
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.19120 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.19120 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.