Hugging Face Daily Papers · · 5 min read

Self-Evolving Visual Questioner

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

🚀 Self-Evolving Visual Questioner (SeeQ)<br>Can Vision-Language Models autonomously improve their question-generating capabilities without human help? <strong>Yes</strong>.</p>\n<ul>\n<li><strong>The Core Framework</strong>: A fully self-supervised loop where a VLM iteratively <strong>proposes</strong> visual questions from raw images, <strong>rewrites/filters</strong> them via a self-critique mechanism, and fine-tunes on its own refined QA data.</li>\n<li><strong>Without External Supervision</strong>: Eliminates any dependency on human-curated annotations or expensive, proprietary teacher APIs.</li>\n<li><strong>Agentic Evaluation Protocol</strong>: Introduces a novel evaluation suite that benchmarks questions based on actual capability—measuring <strong>visual search complexity, spatial grounding, contextual reasoning, and semantic diversity</strong> instead of relying on basic n-gram metrics.</li>\n<li><strong>No Capability Trade-offs</strong>: Substantially upgrades visual question generation quality across multiple open-source VLM backbones while <strong>completely preserving</strong> the models' native visual answering performance.</li>\n<li><strong>Outcomes</strong>: QG <strong>↑82%</strong>, QA preserved on Vstar, CVbench, and RWQA.</li>\n</ul>\n<p>🌐 Project, Code here: <a href=\"https://joliang17.github.io/SelfEvolvingVQG/\" rel=\"nofollow\">https://joliang17.github.io/SelfEvolvingVQG/</a></p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/66720ab819bebc69b5b93685/U0OcUKGg0eMi3V5REcwMf.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/66720ab819bebc69b5b93685/U0OcUKGg0eMi3V5REcwMf.png\" alt=\"overview\"></a></p>\n","updatedAt":"2026-06-17T16:37:47.500Z","author":{"_id":"66720ab819bebc69b5b93685","avatarUrl":"/avatars/b2f1314d9a26f6f5eaf6cebdb0d28812.svg","fullname":"Joanna Liang","name":"joliang17","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7543659210205078},"editors":["joliang17"],"editorAvatarUrls":["/avatars/b2f1314d9a26f6f5eaf6cebdb0d28812.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.13929","authors":[{"_id":"6a32cbb559127a45e2c1c41c","name":"Yijun Liang","hidden":false},{"_id":"6a32cbb559127a45e2c1c41d","name":"Hengguang Zhou","hidden":false},{"_id":"6a32cbb559127a45e2c1c41e","name":"Ming Li","hidden":false},{"_id":"6a32cbb559127a45e2c1c41f","name":"Lichen Li","hidden":false},{"_id":"6a32cbb559127a45e2c1c420","name":"Cho-Jui Hsieh","hidden":false},{"_id":"6a32cbb559127a45e2c1c421","name":"Tianyi Zhou","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/66720ab819bebc69b5b93685/gLYQLy4of1IdDSGcXTiF1.png"],"publishedAt":"2026-06-11T00:00:00.000Z","submittedOnDailyAt":"2026-06-17T00:00:00.000Z","title":"Self-Evolving Visual Questioner","submittedOnDailyBy":{"_id":"66720ab819bebc69b5b93685","avatarUrl":"/avatars/b2f1314d9a26f6f5eaf6cebdb0d28812.svg","isPro":true,"fullname":"Joanna Liang","user":"joliang17","type":"user","name":"joliang17"},"summary":"Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.","upvotes":11,"discussionId":"6a32cbb559127a45e2c1c422","projectPage":"https://joliang17.github.io/SelfEvolvingVQG/","githubRepo":"https://github.com/tianyi-lab/SeeQ","githubRepoAddedBy":"user","ai_summary":"A vision-language model autonomously improves its question-generation capabilities through self-evolution, enhancing both question quality and answerer performance without external supervision.","ai_keywords":["vision-language models","visual questioner","self-evolving framework","visual-centric questions","training data","question generation","answerer mode","questioner mode","agentic protocol","training collapse"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"68b3c3bbc375e05b059370b2","name":"UMCP","fullname":"University of Maryland College Park","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68b3c2c3a4ea236d1a97871a/bji3nI5ZWm2r4JX_-HLo0.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66720ab819bebc69b5b93685","avatarUrl":"/avatars/b2f1314d9a26f6f5eaf6cebdb0d28812.svg","isPro":true,"fullname":"Joanna Liang","user":"joliang17","type":"user"},{"_id":"668431ccf0236757f43df540","avatarUrl":"/avatars/fcb5394860d92a7e304942df9de5d1e3.svg","isPro":false,"fullname":"Ziyue Li","user":"Litzy0619","type":"user"},{"_id":"647f5af5b0e96764589f3b2a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/VJ4cDyjp5M3V5WmI5gPIU.jpeg","isPro":false,"fullname":"Tianyi Zhou","user":"zhoutianyi","type":"user"},{"_id":"64a8121e35fab7cd04c30ed0","avatarUrl":"/avatars/48849b84703158772f1022932331b143.svg","isPro":false,"fullname":"Chenrui Fan","user":"Fcr09","type":"user"},{"_id":"65031d01cccc7b28a388c719","avatarUrl":"/avatars/9d8c94b6ab8ad8b4faba3221b7e76053.svg","isPro":false,"fullname":"Ming Li","user":"MingLiiii","type":"user"},{"_id":"6565ebeda0623adbd76642f3","avatarUrl":"/avatars/5b11f4aabd82ce543ad8db0fe016a0f9.svg","isPro":true,"fullname":"Hengguang Zhou","user":"Dolphin42","type":"user"},{"_id":"67a99ec47b754f038d110926","avatarUrl":"/avatars/e1ff318a42ccb75b094bbe7dae0cabec.svg","isPro":false,"fullname":"Advait Gupta","user":"advaitgupta","type":"user"},{"_id":"64ba4c2565535cf237da429a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ba4c2565535cf237da429a/y_9RrS3V7FZutZV4f38nY.png","isPro":false,"fullname":"Dang Nguyen","user":"dangmn","type":"user"},{"_id":"63f546e0fcf95ecac2b0ee3e","avatarUrl":"/avatars/02a401bcff91cc473d9946bbb771a985.svg","isPro":false,"fullname":"Kwesi Cobbina","user":"kweCobi","type":"user"},{"_id":"672f89e6d7f4171f374dacea","avatarUrl":"/avatars/4a8b378e13e862586bb428fdf000b3cc.svg","isPro":false,"fullname":"NandaKiran Velaga","user":"nandakiran09","type":"user"},{"_id":"639d4b8d860db464ae35c3ab","avatarUrl":"/avatars/ec0fa3e91593a03fc9fb611e66b30553.svg","isPro":true,"fullname":"Shweta Bhardwaj","user":"shweta12","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68b3c3bbc375e05b059370b2","name":"UMCP","fullname":"University of Maryland College Park","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68b3c2c3a4ea236d1a97871a/bji3nI5ZWm2r4JX_-HLo0.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.13929.md","query":{}}">
Papers
arxiv:2606.13929

Self-Evolving Visual Questioner

Published on Jun 11
· Submitted by
Joanna Liang
on Jun 17
Authors:
,
,
,
,
,

Abstract

A vision-language model autonomously improves its question-generation capabilities through self-evolution, enhancing both question quality and answerer performance without external supervision.

Vision-language models (VLMs) are typically trained as passive answerers, while their ability to actively ask diverse, non-trivial, visual-centric and grounded questions remains underexplored. Existing visual questioners' performance is bottlenecked by the availability of high-quality training data or the cost of curating them. We show that a VLM can continuously improve itself as a visual questioner without any external supervision. We propose a self-evolving framework that uses a VLM itself as both a proposer and a filter to produce harder, more informative, and visual-centric questions, while maintaining their exploration diversity to avoid training collapse. These questions are then used to train the VLM in both questioner and answerer modes. To evaluate the questioner, we introduce an agentic protocol that assesses questions along perception, reasoning, and diversity dimensions. Experiments across various backbone VLMs show that our method substantially enhances the quality and substantially expands the difficulty boundary of autonomous question generation. Under the same budget, our self-supervision is more effective than training on the static source data. Moreover, the self-evolving questioner remains a competitive or even better answerer.

Community

Paper submitter about 8 hours ago

🚀 Self-Evolving Visual Questioner (SeeQ)
Can Vision-Language Models autonomously improve their question-generating capabilities without human help? Yes.

  • The Core Framework: A fully self-supervised loop where a VLM iteratively proposes visual questions from raw images, rewrites/filters them via a self-critique mechanism, and fine-tunes on its own refined QA data.
  • Without External Supervision: Eliminates any dependency on human-curated annotations or expensive, proprietary teacher APIs.
  • Agentic Evaluation Protocol: Introduces a novel evaluation suite that benchmarks questions based on actual capability—measuring visual search complexity, spatial grounding, contextual reasoning, and semantic diversity instead of relying on basic n-gram metrics.
  • No Capability Trade-offs: Substantially upgrades visual question generation quality across multiple open-source VLM backbones while completely preserving the models' native visual answering performance.
  • Outcomes: QG ↑82%, QA preserved on Vstar, CVbench, and RWQA.

🌐 Project, Code here: https://joliang17.github.io/SelfEvolvingVQG/

overview

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.13929
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.13929 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.13929 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.13929 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers