Hugging Face Daily Papers · May 25, 2026 · 5 min read

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Merged, consistently across **4 backbones** (Qwen2.5-VL, Qwen3-VL, InternVL3, InternVL3.5).\nOn Qwen3-VL-8B: **+1.46% accuracy** with **20.8% shorter** reasoning traces.\n- Curriculum learning has always meant easy→hard (difficulty axis). We surface a second, orthogonal axis: **which capability** each epoch trains. The two stack additively — on Qwen3-VL-8B, combining both lifts the average **58.6 → 63.0**, beating either axis alone.\n\n🌐 Project: ucsc-vlaa.github.io/VLM-CapCurriculum\n📄 arXiv: arxiv.org/abs/2605.20177\n💻 Code: github.com/UCSC-VLAA/VLM-CapCurriculum\n🤗 HF Collection: UCSC-VLAA/VLM-CapCurriculum","html":"Your VLM didn't fail because it didn't think long enough. It failed because it looked wrong. We found 86.9% of Qwen3-VL-8B's wrong answers trace back to a perception error — not a reasoning one. Our fix: a capability curriculum — a brand-new curriculum dimension that trains perception before reasoning. 🧵\n<ul>\n<li>we decouple post-training along a capability axis into 3 sequential RLVR stages: 🟦 Visual Perception → 🟩 Textual Reasoning → 🟨 Visual Reasoning</li>\n<li>Staged > Merged, consistently across 4 backbones (Qwen2.5-VL, Qwen3-VL, InternVL3, InternVL3.5). On Qwen3-VL-8B: +1.46% accuracy with 20.8% shorter reasoning traces.</li>\n<li>Curriculum learning has always meant easy→hard (difficulty axis). We surface a second, orthogonal axis: which capability each epoch trains. The two stack additively — on Qwen3-VL-8B, combining both lifts the average 58.6 → 63.0, beating either axis alone.</li>\n</ul>\n🌐 Project: ucsc-vlaa.github.io/VLM-CapCurriculum 📄 arXiv: arxiv.org/abs/2605.20177 💻 Code: github.com/UCSC-VLAA/VLM-CapCurriculum 🤗 HF Collection: UCSC-VLAA/VLM-CapCurriculum\n","updatedAt":"2026-05-25T07:33:23.926Z","author":{"_id":"660026b7573abbcdb975a34f","avatarUrl":"/avatars/93defd0e6274cfe8f124220c59ec2bed.svg","fullname":"Juncheng Wu","name":"Chtholly17","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8254561424255371},"editors":["Chtholly17"],"editorAvatarUrls":["/avatars/93defd0e6274cfe8f124220c59ec2bed.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20177","authors":[{"_id":"6a0e0c2c164dbbc68a26c35e","user":{"_id":"660026b7573abbcdb975a34f","avatarUrl":"/avatars/93defd0e6274cfe8f124220c59ec2bed.svg","isPro":false,"fullname":"Juncheng Wu","user":"Chtholly17","type":"user","name":"Chtholly17"},"name":"Juncheng Wu","status":"claimed_verified","statusLastChangedAt":"2026-05-21T19:23:15.729Z","hidden":false},{"_id":"6a0e0c2c164dbbc68a26c35f","name":"Hardy Chen","hidden":false},{"_id":"6a0e0c2c164dbbc68a26c360","name":"Haoqin Tu","hidden":false},{"_id":"6a0e0c2c164dbbc68a26c361","name":"Xianfeng Tang","hidden":false},{"_id":"6a0e0c2c164dbbc68a26c362","name":"Freda Shi","hidden":false},{"_id":"6a0e0c2c164dbbc68a26c363","name":"Hui Liu","hidden":false},{"_id":"6a0e0c2c164dbbc68a26c364","name":"Hanqing Lu","hidden":false},{"_id":"6a0e0c2c164dbbc68a26c365","name":"Cihang Xie","hidden":false},{"_id":"6a0e0c2c164dbbc68a26c366","name":"Yuyin Zhou","hidden":false}],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models","submittedOnDailyBy":{"_id":"660026b7573abbcdb975a34f","avatarUrl":"/avatars/93defd0e6274cfe8f124220c59ec2bed.svg","isPro":false,"fullname":"Juncheng Wu","user":"Chtholly17","type":"user","name":"Chtholly17"},"summary":"Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.","upvotes":4,"discussionId":"6a0e0c2c164dbbc68a26c367","projectPage":"https://ucsc-vlaa.github.io/VLM-CapCurriculum/","githubRepo":"https://github.com/UCSC-VLAA/VLM-CapCurriculum","githubRepoAddedBy":"user","ai_summary":"Staged training approaches that separately optimize visual perception, visual reasoning, and textual reasoning in vision-language models outperform unified training methods, leading to improved performance on visual reasoning tasks.","ai_keywords":["vision-language models","visual perception","visual reasoning","textual reasoning","staged training","reinforcement learning","supervised fine-tuning","curriculum learning","visual math","RealWorldQA","WeMath"],"githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"660026b7573abbcdb975a34f","avatarUrl":"/avatars/93defd0e6274cfe8f124220c59ec2bed.svg","isPro":false,"fullname":"Juncheng Wu","user":"Chtholly17","type":"user"},{"_id":"645cd322a03f3ebb0bdecda9","avatarUrl":"/avatars/60efe7444373772a25f44ac1f0a2a507.svg","isPro":true,"fullname":"SeanWang0027","user":"SeanWang0027","type":"user"},{"_id":"698391f7c79652c087ecd076","avatarUrl":"/avatars/2ec759f1f85486248b3da09bbc0f7d41.svg","isPro":false,"fullname":"Hanqing Lu","user":"HenryLuAI","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20177.md"}">

Papers

arxiv:2605.20177

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Published on May 19

· Submitted by

Juncheng Wu on May 25

Upvote

Authors:

Juncheng Wu ,

Abstract

Staged training approaches that separately optimize visual perception, visual reasoning, and textual reasoning in vision-language models outperform unified training methods, leading to improved performance on visual reasoning tasks.

AI-generated summary

Recent advances in vision-language models (VLMs) emphasize long chain-of-thought reasoning; yet, we find that their performance on visual tasks is primarily limited by a lack of visual perception as opposed to reasoning itself. In this work, we systematically study the interplay between perception and reasoning in VLM post-training by decomposing their capabilities into three separate training stages: visual perception, visual reasoning, and textual reasoning, incorporating specialized training data. We demonstrate that visual perception (a) requires targeted optimization with specialized data; (b) serves as a fundamental scaffold that should be solidified through staged training before refining visual reasoning; and (c) is more effectively learned via RL than caption-based SFT. Our experiments across multiple VLMs demonstrate that staged training consistently improves both visual perception and reasoning performance over merged training. Notably, models trained with our approach achieve 1.5% higher reasoning accuracy with 20.8% shorter reasoning traces, suggesting that superior perception reduces the need for excessive reasoning. Furthermore, we show that this capability-based staging represents a new curriculum dimension orthogonal to traditional difficulty-based curricula, and combining both yields further additive gains. Our staged-training models achieve superior performance among open-weight VLMs, establishing advanced results on several visual math and perception (e.g., +5.2% on WeMath and +3.7% on RealWorldQA) tasks compared with the base counterpart.

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

Chtholly17

Paper author Paper submitter about 3 hours ago

Your VLM didn't fail because it didn't think long enough. It failed because it looked wrong. We found 86.9% of Qwen3-VL-8B's wrong answers trace back to a perception error — not a reasoning one.
Our fix: a capability curriculum — a brand-new curriculum dimension that trains perception before reasoning. 🧵

we decouple post-training along a capability axis into 3 sequential RLVR stages:
🟦 Visual Perception → 🟩 Textual Reasoning → 🟨 Visual Reasoning
Staged > Merged, consistently across 4 backbones (Qwen2.5-VL, Qwen3-VL, InternVL3, InternVL3.5).
On Qwen3-VL-8B: +1.46% accuracy with 20.8% shorter reasoning traces.
Curriculum learning has always meant easy→hard (difficulty axis). We surface a second, orthogonal axis: which capability each epoch trains. The two stack additively — on Qwen3-VL-8B, combining both lifts the average 58.6 → 63.0, beating either axis alone.

🌐 Project: ucsc-vlaa.github.io/VLM-CapCurriculum
📄 arXiv: arxiv.org/abs/2605.20177
💻 Code: github.com/UCSC-VLAA/VLM-CapCurriculum
🤗 HF Collection: UCSC-VLAA/VLM-CapCurriculum

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.20177

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 4

Datasets citing this paper 3

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.20177 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models

Abstract

Community

Models citing this paper 4

Datasets citing this paper 3

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers