Hugging Face Daily Papers · May 13, 2026 · 5 min read

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

This work introduces On-policy Data Evolution (ODE) for visual-native multimodal deep search agents. It combines an image-bank reference protocol, which allows tool-generated images to be reused across later tool calls, with a closed-loop data evolution pipeline driven by policy rollouts. The results show strong gains across eight multimodal deep search benchmarks, highlighting that effective multimodal agents require both persistent visual state and policy-adaptive data construction.</p>\n","updatedAt":"2026-05-13T05:23:43.144Z","author":{"_id":"64ce05c631c655ff8a2e183c","avatarUrl":"/avatars/f2de7f8a1348b05f46946085e3e9718e.svg","fullname":"Shijue Huang","name":"JoeYing","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":21,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7909855246543884},"editors":["JoeYing"],"editorAvatarUrls":["/avatars/f2de7f8a1348b05f46946085e3e9718e.svg"],"reactions":[{"reaction":"🚀","users":["leezythu","Rosiness","dongguanting"],"count":3},{"reaction":"🤗","users":["dongguanting"],"count":1}],"isReport":false}},{"id":"6a0456b9ccfa1da91627b779","author":{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","fullname":"KABI","name":"dongguanting","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":76,"isUserFollowing":false},"createdAt":"2026-05-13T10:47:21.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Really interesting work on self-evolving multimodal agents! ","html":"<p>Really interesting work on self-evolving multimodal agents! </p>\n","updatedAt":"2026-05-13T10:47:21.417Z","author":{"_id":"61cd4b833dd34ba1985e0753","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png","fullname":"KABI","name":"dongguanting","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":76,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9043468832969666},"editors":["dongguanting"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/61cd4b833dd34ba1985e0753/BfHfrwotoMESpXZOHiIe4.png"],"reactions":[{"reaction":"❤️","users":["dongguanting","JoeYing"],"count":2}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.10832","authors":[{"_id":"6a02a8e1b823258e76123589","user":{"_id":"64ce05c631c655ff8a2e183c","avatarUrl":"/avatars/f2de7f8a1348b05f46946085e3e9718e.svg","isPro":false,"fullname":"Shijue Huang","user":"JoeYing","type":"user","name":"JoeYing"},"name":"Shijue Huang","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:54:00.989Z","hidden":false},{"_id":"6a02a8e1b823258e7612358a","name":"Hangyu Guo","hidden":false},{"_id":"6a02a8e1b823258e7612358b","name":"Chenxin Li","hidden":false},{"_id":"6a02a8e1b823258e7612358c","name":"Junting Lu","hidden":false},{"_id":"6a02a8e1b823258e7612358d","name":"Xinyu Geng","hidden":false},{"_id":"6a02a8e1b823258e7612358e","name":"Zhaochen Su","hidden":false},{"_id":"6a02a8e1b823258e7612358f","name":"Zhenyu Li","hidden":false},{"_id":"6a02a8e1b823258e76123590","name":"Shuang Chen","hidden":false},{"_id":"6a02a8e1b823258e76123591","name":"Hongru Wang","hidden":false},{"_id":"6a02a8e1b823258e76123592","name":"Yi R. Fung","hidden":false}],"publishedAt":"2026-05-11T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents","submittedOnDailyBy":{"_id":"64ce05c631c655ff8a2e183c","avatarUrl":"/avatars/f2de7f8a1348b05f46946085e3e9718e.svg","isPro":false,"fullname":"Shijue Huang","user":"JoeYing","type":"user","name":"JoeYing"},"summary":"Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.","upvotes":15,"discussionId":"6a02a8e2b823258e76123593","projectPage":"https://on-policy-data-evolution.github.io/","githubRepo":"https://github.com/JoeYing1019/ODE","githubRepoAddedBy":"user","ai_summary":"A visual-native agent harness with image bank reference protocol enables reusable intermediate visual evidence and closed-loop data generation that improves multimodal deep search performance across multiple benchmarks.","ai_keywords":["multimodal deep search","tool-use harness","image bank reference protocol","visual reasoning","policy-aware reinforcement learning","supervised fine-tuning","closed-loop data generator","rollout-feedback evolution","visual evidence reuse"],"githubStars":15,"organization":{"_id":"6609f50bf4ab651901ae4541","name":"hongkongust","fullname":"The Hong Kong University of Science and Technology","avatar":"https://www.gravatar.com/avatar/4722e71b09c8244cf76e358091504d3b?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64ce05c631c655ff8a2e183c","avatarUrl":"/avatars/f2de7f8a1348b05f46946085e3e9718e.svg","isPro":false,"fullname":"Shijue Huang","user":"JoeYing","type":"user"},{"_id":"646b43deb1202bc77c1024a4","avatarUrl":"/avatars/cf791574ab986bac274e7fbcf04e2a59.svg","isPro":false,"fullname":"hangyu guo","user":"Rosiness","type":"user"},{"_id":"656c2911f7be0986b48442f0","avatarUrl":"/avatars/2c83baae8f0ae814d991c47e85279c98.svg","isPro":false,"fullname":"zhenyu li","user":"leezythu","type":"user"},{"_id":"64264095ba51f8a2136946a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64264095ba51f8a2136946a0/FR33boVpkDXcrvGMBmprF.jpeg","isPro":false,"fullname":"Zhaochen Su","user":"Warrieryes","type":"user"},{"_id":"650137c5e102da55f9fdb035","avatarUrl":"/avatars/194d9506a9dce23bc487d7e4d9fa4f48.svg","isPro":false,"fullname":"Li Ang","user":"leonfrancis","type":"user"},{"_id":"66cc19ad6f8945277c39cd86","avatarUrl":"/avatars/f031e77d7fb9dbebbc00fba9b5dd5357.svg","isPro":false,"fullname":"Kong","user":"csfufu","type":"user"},{"_id":"646ac89935c7a57f936c2f78","avatarUrl":"/avatars/e0dda4a67a5971ac745fcf0cc9f16e06.svg","isPro":false,"fullname":"Guo Dadi","user":"guodadi","type":"user"},{"_id":"6533b59dd434308ba4391e5e","avatarUrl":"/avatars/a0c5eb5c3e505ca379379c78d7d15dc0.svg","isPro":false,"fullname":"TerenceCai","user":"Terence1023","type":"user"},{"_id":"6657794f4018967357a83902","avatarUrl":"/avatars/cebfe74e3aaa50044b507057e2d8a788.svg","isPro":false,"fullname":"Merlin Wang","user":"CarreyWang","type":"user"},{"_id":"65141bfb5f99d14097bf72a7","avatarUrl":"/avatars/8497810edf56a3928dc9233c36fc74d7.svg","isPro":false,"fullname":"Han Cui","user":"hancui","type":"user"},{"_id":"6086838b19137b3a6ba760e7","avatarUrl":"/avatars/d63eea3e39b22c6e65b82c28192696f1.svg","isPro":false,"fullname":"Jianhao Yan","user":"Elliott","type":"user"},{"_id":"644915c5e87a77e872e61350","avatarUrl":"/avatars/46ba7bdf04ad4c1b0ad79155010dc684.svg","isPro":false,"fullname":"Luo","user":"ramiroluo","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6609f50bf4ab651901ae4541","name":"hongkongust","fullname":"The Hong Kong University of Science and Technology","avatar":"https://www.gravatar.com/avatar/4722e71b09c8244cf76e358091504d3b?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.10832.md"}">

Papers

arxiv:2605.10832

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

Published on May 11

· Submitted by

Shijue Huang on May 13

The Hong Kong University of Science and Technology

Upvote

Authors:

Shijue Huang ,

Abstract

A visual-native agent harness with image bank reference protocol enables reusable intermediate visual evidence and closed-loop data generation that improves multimodal deep search performance across multiple benchmarks.

AI-generated summary

Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.