Hugging Face Daily Papers · · 3 min read

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We introduce OPD-Evolver, a slow-fast co-evolution framework that helps agents not only store experience, but learn how to select, use, write, and maintain it. Across multi-domain benchmarks, OPD-Evolver outperforms existing memory systems, skill-enhanced agents and 300+B counterparts, showing strong potential for building truly self-evolving agents.</p>\n<p>GitHub: <a href=\"https://github.com/bingreeky/opd-evolver\" rel=\"nofollow\">https://github.com/bingreeky/opd-evolver</a></p>\n<p>Huggingface: <a href=\"https://huggingface.co/greeky/OPDEvolver/\">https://huggingface.co/greeky/OPDEvolver/</a></p>\n","updatedAt":"2026-06-17T02:07:27.036Z","author":{"_id":"6363a1fa123a5d5cd4a800e2","avatarUrl":"/avatars/a0961ca5463aae05de0b1574c0064fae.svg","fullname":"gbz","name":"greeky","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8610682487487793},"editors":["greeky"],"editorAvatarUrls":["/avatars/a0961ca5463aae05de0b1574c0064fae.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17628","authors":[{"_id":"6a31fa47bc818ff14e453ca4","name":"Guibin Zhang","hidden":false},{"_id":"6a31fa47bc818ff14e453ca5","name":"Xun Xu","hidden":false},{"_id":"6a31fa47bc818ff14e453ca6","name":"Yanwei Yue","hidden":false},{"_id":"6a31fa47bc818ff14e453ca7","name":"Zikun Su","hidden":false},{"_id":"6a31fa47bc818ff14e453ca8","name":"Wangchunshu Zhou","hidden":false},{"_id":"6a31fa47bc818ff14e453ca9","name":"Xiaobin Hu","hidden":false},{"_id":"6a31fa47bc818ff14e453caa","name":"Shuicheng Yan","hidden":false}],"publishedAt":"2026-06-16T00:00:00.000Z","submittedOnDailyAt":"2026-06-17T00:00:00.000Z","title":"OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation","submittedOnDailyBy":{"_id":"6363a1fa123a5d5cd4a800e2","avatarUrl":"/avatars/a0961ca5463aae05de0b1574c0064fae.svg","isPro":false,"fullname":"gbz","user":"greeky","type":"user","name":"greeky"},"summary":"Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.","upvotes":21,"discussionId":"6a31fa47bc818ff14e453cab","githubRepo":"https://github.com/bingreeky/opd-evolver","githubRepoAddedBy":"user","ai_summary":"OPD-Evolver is a self-evolving agent framework that combines slow-fast co-evolution with on-policy self-distillation to enhance memory management and policy learning across multiple domains.","ai_keywords":["self-evolving agents","memory hierarchy","on-policy self-distillation","slow-fast co-evolution","policy learning","memory management","experience retention","agent evolver"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":7,"organization":{"_id":"6508ab2b349930913196378b","name":"NationalUniversityofSingapore","fullname":"National University of Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZYUmpSMsa5Whihw3me2Bw.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6717c5c36bc2876059ed23ab","avatarUrl":"/avatars/52c68fb315760df5ef9323cd8ada5a3c.svg","isPro":false,"fullname":"Xin Zhou","user":"LMD0311","type":"user"},{"_id":"67bde7dec73e0b462c34d379","avatarUrl":"/avatars/f7656adc28805490124b6ed73fe73858.svg","isPro":false,"fullname":"Kana Boon","user":"Kana-s","type":"user"},{"_id":"681af470040badacde961220","avatarUrl":"/avatars/101c5c81a42b906b586bd5b600d5e68e.svg","isPro":false,"fullname":"Sun_Lx","user":"seanxunx","type":"user"},{"_id":"65e1d98582549cce484798aa","avatarUrl":"/avatars/4c50f96c652bac65b0fa18a4979242e8.svg","isPro":false,"fullname":"Lin","user":"aijwhedqie","type":"user"},{"_id":"6569474501c02495cec2cbae","avatarUrl":"/avatars/5d383738269a092192f3822d0248fd43.svg","isPro":false,"fullname":"Yibo Li","user":"liushiliushi","type":"user"},{"_id":"686e506e573b80f69c3bd7d7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/95CEjV6TRp3ah7gnSF9_5.png","isPro":false,"fullname":"EdwinMars","user":"EdwinYue","type":"user"},{"_id":"6363a1fa123a5d5cd4a800e2","avatarUrl":"/avatars/a0961ca5463aae05de0b1574c0064fae.svg","isPro":false,"fullname":"gbz","user":"greeky","type":"user"},{"_id":"67d63e228d5c7a132cbcf39b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ynwA3Sya5irwMRCmSeLiC.png","isPro":false,"fullname":"neil yu","user":"yxl66666","type":"user"},{"_id":"67c7d44419b236e0565358f4","avatarUrl":"/avatars/36f6dfd8c8dcdd69d8a9af5e58d978a4.svg","isPro":false,"fullname":"Zihang Liu","user":"zh-liu799","type":"user"},{"_id":"668def32af57d6e4b5a6d703","avatarUrl":"/avatars/508bd9eb2e9af1af0febc934f438f33f.svg","isPro":false,"fullname":"yiboyan","user":"Buluchacha","type":"user"},{"_id":"63f1c4c2bc705ef8c2407466","avatarUrl":"/avatars/8409f5963dd59e676527acdc08d34f41.svg","isPro":false,"fullname":"zz","user":"ydyjya","type":"user"},{"_id":"651e2a19242e10766e61a669","avatarUrl":"/avatars/e5351fb14997269d8d3b9539f6f27d9e.svg","isPro":false,"fullname":"caiyuchen","user":"caiyuchen","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6508ab2b349930913196378b","name":"NationalUniversityofSingapore","fullname":"National University of Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZYUmpSMsa5Whihw3me2Bw.png"},"query":{}}">
Papers
arxiv:2606.17628

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

Published on Jun 16
· Submitted by
gbz
on Jun 17
Authors:
,
,
,
,
,
,

Abstract

OPD-Evolver is a self-evolving agent framework that combines slow-fast co-evolution with on-policy self-distillation to enhance memory management and policy learning across multiple domains.

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.

Community

Paper submitter about 23 hours ago

We introduce OPD-Evolver, a slow-fast co-evolution framework that helps agents not only store experience, but learn how to select, use, write, and maintain it. Across multi-domain benchmarks, OPD-Evolver outperforms existing memory systems, skill-enhanced agents and 300+B counterparts, showing strong potential for building truly self-evolving agents.

GitHub: https://github.com/bingreeky/opd-evolver

Huggingface: https://huggingface.co/greeky/OPDEvolver/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17628 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17628 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers