Hugging Face Daily Papers · · 4 min read

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/68d77af21aa476c3485f7f26/JADKxEKNdFWkwF7Et75d8.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/68d77af21aa476c3485f7f26/JADKxEKNdFWkwF7Et75d8.png\" alt=\"uMX0huxboANgFxdh40Sss\"></a><br>V-Zero improves fine-grained visual reasoning without annotated answer labels. The student model samples on-policy reasoning trajectories from the full image, while a teacher model replays the same trajectories with paired positive and negative visual evidence views. By contrasting teacher support under the task-relevant crop and an irrelevant crop, V-Zero estimates how well each trajectory is grounded in visual evidence and uses this signal to gate dense token-level distillation. The resulting training objective keeps standard full-image inference unchanged while providing answer-label-free supervision for localized visual reasoning.</p>\n","updatedAt":"2026-06-25T04:54:05.082Z","author":{"_id":"68d77af21aa476c3485f7f26","avatarUrl":"/avatars/06bc0b24b2a88223843cfd19a0159ff4.svg","fullname":"haoxiang sun","name":"hao05","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8081892728805542},"editors":["hao05"],"editorAvatarUrls":["/avatars/06bc0b24b2a88223843cfd19a0159ff4.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.25319","authors":[{"_id":"6a3cb46bf3facdb67e9ff234","name":"Haoxiang Sun","hidden":false},{"_id":"6a3cb46bf3facdb67e9ff235","name":"Zhihang Yi","hidden":false},{"_id":"6a3cb46bf3facdb67e9ff236","name":"Langxuan Deng","hidden":false},{"_id":"6a3cb46bf3facdb67e9ff237","name":"Yuhao Zhou","hidden":false},{"_id":"6a3cb46bf3facdb67e9ff238","name":"Peiqi Jia","hidden":false},{"_id":"6a3cb46bf3facdb67e9ff239","name":"Jian Zhao","hidden":false},{"_id":"6a3cb46bf3facdb67e9ff23a","name":"Li Yuan","hidden":false},{"_id":"6a3cb46bf3facdb67e9ff23b","name":"Jiancheng Lv","hidden":false},{"_id":"6a3cb46bf3facdb67e9ff23c","name":"Tao Wang","hidden":false}],"publishedAt":"2026-06-24T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning","submittedOnDailyBy":{"_id":"68d77af21aa476c3485f7f26","avatarUrl":"/avatars/06bc0b24b2a88223843cfd19a0159ff4.svg","isPro":true,"fullname":"haoxiang sun","user":"hao05","type":"user","name":"hao05"},"summary":"Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-scale annotated reasoning traces, leading to costly exploration, hand-designed verification rules, or heavy dependence on textual supervision. A natural way to avoid such external answer labels is to learn from trajectories sampled by the student itself, which points to On-Policy Distillation (OPD). To understand what OPD can and cannot provide for visual reasoning, we revisit it as negative-free stop-gradient alignment. This perspective shows that, although OPD provides effective token-level correction, its ceiling is constrained by the absence of trajectory-level discrimination. Motivated by these observations, we propose V-Zero, an answer-label-free framework for visual reasoning with contrastive evidence gating. V-Zero uses no annotated textual answer labels; instead, during training it pairs a question-relevant regional crop with a negative visual view to evaluate student-sampled trajectories and gate dense token-level distillation. Experiments on multiple visual reasoning benchmarks show that V-Zero consistently improves fine-grained visual reasoning while preserving strong generalization. Notably, V-Zero is more than 5times faster than previous supervised fine-tuning methods and more than 10times faster than reinforcement learning baselines. Code and dataset will be released at https://github.com/eVI-group-SCU/V-Zero","upvotes":9,"discussionId":"6a3cb46cf3facdb67e9ff23d","projectPage":"https://github.com/eVI-group-SCU/V-Zero","githubRepo":"https://github.com/eVI-group-SCU/V-Zero","githubRepoAddedBy":"user","ai_summary":"A novel label-free framework for visual reasoning called V-Zero is presented, which uses contrastive evidence gating to improve fine-grained visual reasoning without requiring annotated answer labels, achieving faster training than traditional methods.","ai_keywords":["multimodal large language models","visual reasoning","On-Policy Distillation","stop-gradient alignment","contrastive evidence gating","token-level correction","trajectory-level discrimination","fine-grained visual reasoning","supervised fine-tuning","reinforcement learning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":14,"organization":{"_id":"634047449dbfe0d48b2b4119","name":"SCU-China","fullname":"Sichuan University Alumni","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665157077101-634042d9ea76a1e0b478a1b7.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68d77af21aa476c3485f7f26","avatarUrl":"/avatars/06bc0b24b2a88223843cfd19a0159ff4.svg","isPro":true,"fullname":"haoxiang sun","user":"hao05","type":"user"},{"_id":"677f72b1370f44d9d69bda85","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/P0DN-MI0VGZJvCr_MDCI-.png","isPro":false,"fullname":"deBroglieWave","user":"yzhbradoodrrpurp","type":"user"},{"_id":"6a3a208196be7e84d2a2659a","avatarUrl":"/avatars/1dda70a67ecbfd3a4bf56949f45e81b1.svg","isPro":false,"fullname":"Tuoyu Liu","user":"Helen20040916","type":"user"},{"_id":"6a3cbdb11f3adcfadb784dc4","avatarUrl":"/avatars/0c5db5c47c43cf9ff068bd06b29420ce.svg","isPro":false,"fullname":"yuehan Zhang","user":"johnz514","type":"user"},{"_id":"6a12b19b48eb5ff58ed946df","avatarUrl":"/avatars/1ef5c1e829c59018d586431e10d0aa3f.svg","isPro":false,"fullname":"Yuhao Zhou","user":"Skyzyh","type":"user"},{"_id":"682704301aec0154c1f8da51","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/JnlWzNkZtYykno0SlQXx_.jpeg","isPro":false,"fullname":"Xu","user":"Lumiat","type":"user"},{"_id":"675aa03a72aadf30cae2b590","avatarUrl":"/avatars/493c8b97b711aa83be22a5221600451e.svg","isPro":false,"fullname":"Hanxu Yan","user":"QQ1079984824","type":"user"},{"_id":"68b25ca652a6c9d01d7863fc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/mvmKggRCeewZpXiJYR8VP.png","isPro":false,"fullname":"baiyutao","user":"Azrmedit0x","type":"user"},{"_id":"661df4bfbca1a6038b5ae3cf","avatarUrl":"/avatars/39e1b572a7ddf22b77e2619c59a35e1a.svg","isPro":false,"fullname":"tao wang","user":"twangnh","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"634047449dbfe0d48b2b4119","name":"SCU-China","fullname":"Sichuan University Alumni","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1665157077101-634042d9ea76a1e0b478a1b7.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.25319.md","query":{}}">
Papers
arxiv:2606.25319

V-Zero: Answer-Label-Free On-Policy Distillation with Contrastive Evidence Gating for Fine-Grained Visual Reasoning

Published on Jun 24
· Submitted by
haoxiang sun
on Jun 25
Authors:
,
,
,
,
,
,
,
,

Abstract

A novel label-free framework for visual reasoning called V-Zero is presented, which uses contrastive evidence gating to improve fine-grained visual reasoning without requiring annotated answer labels, achieving faster training than traditional methods.

Fine-grained visual reasoning requires multimodal large language models (MLLMs) to identify task-relevant visual evidence and ground their reasoning in local image regions. Existing agentic methods typically rely on reinforcement learning with verifiable rewards or supervised fine-tuning on large-scale annotated reasoning traces, leading to costly exploration, hand-designed verification rules, or heavy dependence on textual supervision. A natural way to avoid such external answer labels is to learn from trajectories sampled by the student itself, which points to On-Policy Distillation (OPD). To understand what OPD can and cannot provide for visual reasoning, we revisit it as negative-free stop-gradient alignment. This perspective shows that, although OPD provides effective token-level correction, its ceiling is constrained by the absence of trajectory-level discrimination. Motivated by these observations, we propose V-Zero, an answer-label-free framework for visual reasoning with contrastive evidence gating. V-Zero uses no annotated textual answer labels; instead, during training it pairs a question-relevant regional crop with a negative visual view to evaluate student-sampled trajectories and gate dense token-level distillation. Experiments on multiple visual reasoning benchmarks show that V-Zero consistently improves fine-grained visual reasoning while preserving strong generalization. Notably, V-Zero is more than 5times faster than previous supervised fine-tuning methods and more than 10times faster than reinforcement learning baselines. Code and dataset will be released at https://github.com/eVI-group-SCU/V-Zero

Community

Paper submitter about 4 hours ago

uMX0huxboANgFxdh40Sss
V-Zero improves fine-grained visual reasoning without annotated answer labels. The student model samples on-policy reasoning trajectories from the full image, while a teacher model replays the same trajectories with paired positive and negative visual evidence views. By contrasting teacher support under the task-relevant crop and an irrelevant crop, V-Zero estimates how well each trajectory is grounded in visual evidence and uses this signal to gate dense token-level distillation. The resulting training objective keeps standard full-image inference unchanged while providing answer-label-free supervision for localized visual reasoning.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.25319
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.25319 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.25319 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers