Hugging Face Daily Papers · · 5 min read

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

This work introduces On-policy Data Evolution (ODE) for visual-native multimodal deep search agents. It combines an image-bank reference protocol, which allows tool-generated images to be reused across later tool calls, with a closed-loop data evolution pipeline driven by policy rollouts. The results show strong gains across eight multimodal deep search benchmarks, highlighting that effective multimodal agents require both persistent visual state and policy-adaptive data construction.</p>\n","updatedAt":"2026-05-13T05:23:43.144Z","author":{"_id":"64ce05c631c655ff8a2e183c","avatarUrl":"/avatars/f2de7f8a1348b05f46946085e3e9718e.svg","fullname":"Shijue Huang","name":"JoeYing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7909855246543884},"editors":["JoeYing"],"editorAvatarUrls":["/avatars/f2de7f8a1348b05f46946085e3e9718e.svg"],"reactions":[{"reaction":"🚀","users":["leezythu","Rosiness","dongguanting"],"count":3},{"reaction":"🤗","users":["dongguanting"],"count":1}],"isReport":false}},{"id":"6a0456b9ccfa1da91627b779","author":{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","fullname":"KABI","name":"dongguanting","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":76,"isUserFollowing":false},"createdAt":"2026-05-13T10:47:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Really interesting work on self-evolving multimodal agents! ","html":"<p>Really interesting work on self-evolving multimodal agents! </p>\n","updatedAt":"2026-05-13T10:47:21.417Z","author":{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","fullname":"KABI","name":"dongguanting","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":76,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9043468832969666},"editors":["dongguanting"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png"],"reactions":[{"reaction":"❤️","users":["dongguanting","JoeYing"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.10832","authors":[{"_id":"6a02a8e1b823258e76123589","user":{"_id":"64ce05c631c655ff8a2e183c","avatarUrl":"/avatars/f2de7f8a1348b05f46946085e3e9718e.svg","isPro":false,"fullname":"Shijue Huang","user":"JoeYing","type":"user","name":"JoeYing"},"name":"Shijue Huang","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:54:00.989Z","hidden":false},{"_id":"6a02a8e1b823258e7612358a","name":"Hangyu Guo","hidden":false},{"_id":"6a02a8e1b823258e7612358b","name":"Chenxin Li","hidden":false},{"_id":"6a02a8e1b823258e7612358c","name":"Junting Lu","hidden":false},{"_id":"6a02a8e1b823258e7612358d","name":"Xinyu Geng","hidden":false},{"_id":"6a02a8e1b823258e7612358e","name":"Zhaochen Su","hidden":false},{"_id":"6a02a8e1b823258e7612358f","name":"Zhenyu Li","hidden":false},{"_id":"6a02a8e1b823258e76123590","name":"Shuang Chen","hidden":false},{"_id":"6a02a8e1b823258e76123591","name":"Hongru Wang","hidden":false},{"_id":"6a02a8e1b823258e76123592","name":"Yi R. Fung","hidden":false}],"publishedAt":"2026-05-11T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents","submittedOnDailyBy":{"_id":"64ce05c631c655ff8a2e183c","avatarUrl":"/avatars/f2de7f8a1348b05f46946085e3e9718e.svg","isPro":false,"fullname":"Shijue Huang","user":"JoeYing","type":"user","name":"JoeYing"},"summary":"Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.","upvotes":15,"discussionId":"6a02a8e2b823258e76123593","projectPage":"https://on-policy-data-evolution.github.io/","githubRepo":"https://github.com/JoeYing1019/ODE","githubRepoAddedBy":"user","ai_summary":"A visual-native agent harness with image bank reference protocol enables reusable intermediate visual evidence and closed-loop data generation that improves multimodal deep search performance across multiple benchmarks.","ai_keywords":["multimodal deep search","tool-use harness","image bank reference protocol","visual reasoning","policy-aware reinforcement learning","supervised fine-tuning","closed-loop data generator","rollout-feedback evolution","visual evidence reuse"],"githubStars":15,"organization":{"_id":"6609f50bf4ab651901ae4541","name":"hongkongust","fullname":"The Hong Kong University of Science and Technology","avatar":"https://www.gravatar.com/avatar/4722e71b09c8244cf76e358091504d3b?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64ce05c631c655ff8a2e183c","avatarUrl":"/avatars/f2de7f8a1348b05f46946085e3e9718e.svg","isPro":false,"fullname":"Shijue Huang","user":"JoeYing","type":"user"},{"_id":"646b43deb1202bc77c1024a4","avatarUrl":"/avatars/cf791574ab986bac274e7fbcf04e2a59.svg","isPro":false,"fullname":"hangyu guo","user":"Rosiness","type":"user"},{"_id":"656c2911f7be0986b48442f0","avatarUrl":"/avatars/2c83baae8f0ae814d991c47e85279c98.svg","isPro":false,"fullname":"zhenyu li","user":"leezythu","type":"user"},{"_id":"64264095ba51f8a2136946a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64264095ba51f8a2136946a0/FR33boVpkDXcrvGMBmprF.jpeg","isPro":false,"fullname":"Zhaochen Su","user":"Warrieryes","type":"user"},{"_id":"650137c5e102da55f9fdb035","avatarUrl":"/avatars/194d9506a9dce23bc487d7e4d9fa4f48.svg","isPro":false,"fullname":"Li Ang","user":"leonfrancis","type":"user"},{"_id":"66cc19ad6f8945277c39cd86","avatarUrl":"/avatars/f031e77d7fb9dbebbc00fba9b5dd5357.svg","isPro":false,"fullname":"Kong","user":"csfufu","type":"user"},{"_id":"646ac89935c7a57f936c2f78","avatarUrl":"/avatars/e0dda4a67a5971ac745fcf0cc9f16e06.svg","isPro":false,"fullname":"Guo Dadi","user":"guodadi","type":"user"},{"_id":"6533b59dd434308ba4391e5e","avatarUrl":"/avatars/a0c5eb5c3e505ca379379c78d7d15dc0.svg","isPro":false,"fullname":"TerenceCai","user":"Terence1023","type":"user"},{"_id":"6657794f4018967357a83902","avatarUrl":"/avatars/cebfe74e3aaa50044b507057e2d8a788.svg","isPro":false,"fullname":"Merlin Wang","user":"CarreyWang","type":"user"},{"_id":"65141bfb5f99d14097bf72a7","avatarUrl":"/avatars/8497810edf56a3928dc9233c36fc74d7.svg","isPro":false,"fullname":"Han Cui","user":"hancui","type":"user"},{"_id":"6086838b19137b3a6ba760e7","avatarUrl":"/avatars/d63eea3e39b22c6e65b82c28192696f1.svg","isPro":false,"fullname":"Jianhao Yan","user":"Elliott","type":"user"},{"_id":"644915c5e87a77e872e61350","avatarUrl":"/avatars/46ba7bdf04ad4c1b0ad79155010dc684.svg","isPro":false,"fullname":"Luo","user":"ramiroluo","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6609f50bf4ab651901ae4541","name":"hongkongust","fullname":"The Hong Kong University of Science and Technology","avatar":"https://www.gravatar.com/avatar/4722e71b09c8244cf76e358091504d3b?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.10832.md"}">
Papers
arxiv:2605.10832

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Published on May 11
· Submitted by
Shijue Huang
on May 13
Authors:
,
,
,
,
,
,
,
,

Abstract

A visual-native agent harness with image bank reference protocol enables reusable intermediate visual evidence and closed-loop data generation that improves multimodal deep search performance across multiple benchmarks.

AI-generated summary

Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.

Community

Paper author Paper submitter about 16 hours ago

This work introduces On-policy Data Evolution (ODE) for visual-native multimodal deep search agents. It combines an image-bank reference protocol, which allows tool-generated images to be reused across later tool calls, with a closed-loop data evolution pipeline driven by policy rollouts. The results show strong gains across eight multimodal deep search benchmarks, highlighting that effective multimodal agents require both persistent visual state and policy-adaptive data construction.

Really interesting work on self-evolving multimodal agents!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.10832
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.10832 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.10832 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.10832 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers