Project page: <a href=\"https://quanhaol.github.io/DiffusionOPD-site/\" rel=\"nofollow\">https://quanhaol.github.io/DiffusionOPD-site/</a></p>\n","updatedAt":"2026-05-15T03:23:42.276Z","author":{"_id":"65818917554131ec95df3829","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65818917554131ec95df3829/YIbgLCllm-0u88pRLhpVZ.jpeg","fullname":"Quanhao Li","name":"quanhaol","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.23114795982837677},"editors":["quanhaol"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65818917554131ec95df3829/YIbgLCllm-0u88pRLhpVZ.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15055","authors":[{"_id":"6a069141b1a8cbabc9f0994e","name":"Quanhao Li","hidden":false},{"_id":"6a069141b1a8cbabc9f0994f","name":"Junqiu Yu","hidden":false},{"_id":"6a069141b1a8cbabc9f09950","name":"Kaixun Jiang","hidden":false},{"_id":"6a069141b1a8cbabc9f09951","name":"Yujie Wei","hidden":false},{"_id":"6a069141b1a8cbabc9f09952","name":"Zhen Xing","hidden":false},{"_id":"6a069141b1a8cbabc9f09953","name":"Pandeng Li","hidden":false},{"_id":"6a069141b1a8cbabc9f09954","name":"Ruihang Chu","hidden":false},{"_id":"6a069141b1a8cbabc9f09955","name":"Shiwei Zhang","hidden":false},{"_id":"6a069141b1a8cbabc9f09956","name":"Yu Liu","hidden":false},{"_id":"6a069141b1a8cbabc9f09957","name":"Zuxuan Wu","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models","submittedOnDailyBy":{"_id":"65818917554131ec95df3829","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65818917554131ec95df3829/YIbgLCllm-0u88pRLhpVZ.jpeg","isPro":false,"fullname":"Quanhao Li","user":"quanhaol","type":"user","name":"quanhaol"},"summary":"Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.","upvotes":14,"discussionId":"6a069141b1a8cbabc9f09958","projectPage":"https://quanhaol.github.io/DiffusionOPD-site/","ai_summary":"DiffusionOPD enables efficient multi-task training for diffusion models through online policy distillation, outperforming existing reinforcement learning approaches in both training efficiency and final performance.","ai_keywords":["diffusion models","reinforcement learning","multi-task training","online policy distillation","task-specific teachers","unified student","stochastic SDE","deterministic ODE","KL objective","PPO-style policy gradients"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65818917554131ec95df3829","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65818917554131ec95df3829/YIbgLCllm-0u88pRLhpVZ.jpeg","isPro":false,"fullname":"Quanhao Li","user":"quanhaol","type":"user"},{"_id":"637f70d6fab5db9101c3dfc8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637f70d6fab5db9101c3dfc8/NgkYNXWLDavLbrnCby2Fl.jpeg","isPro":false,"fullname":"Yujie Wei","user":"weilllllls","type":"user"},{"_id":"6461e09759daabed7575b7a2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6461e09759daabed7575b7a2/sxfko49Q7ta0dCfgrpqoB.jpeg","isPro":false,"fullname":"PinxueGuo","user":"PinxueGuo","type":"user"},{"_id":"6547024a5cd5692b3abe33a9","avatarUrl":"/avatars/e638a421c03f6f42f973354b2c4c9e14.svg","isPro":false,"fullname":"minghao han","user":"minghaofdu","type":"user"},{"_id":"6690d9c2a534fc4f3c5a2ea8","avatarUrl":"/avatars/7ce50e287205ed10be66d0ae3df597a4.svg","isPro":false,"fullname":"MichaelYu","user":"MichaelYu781","type":"user"},{"_id":"667a7f1d3b78e49a81ab02c2","avatarUrl":"/avatars/b60cb5b9070c7573a7f241407d705ecc.svg","isPro":false,"fullname":"Zhihang Liu","user":"lntzm","type":"user"},{"_id":"6332cd69fdc3d1fc8058515c","avatarUrl":"/avatars/83682bcac1f841adc4de50e2961ec624.svg","isPro":false,"fullname":"zhen xing","user":"Chenhsing","type":"user"},{"_id":"642e3bcb958faf258a40e89c","avatarUrl":"/avatars/dad142df2217f8eed1f45c9e7287d3ea.svg","isPro":false,"fullname":"Ruihang Chu","user":"Ruihang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"66b2e9ed5409fe4fb5c7c24e","avatarUrl":"/avatars/c1fa329c0f9d26ee0e68d994b53f2679.svg","isPro":false,"fullname":"panda","user":"hughug774","type":"user"},{"_id":"63315fbaacb6472115aaee86","avatarUrl":"/avatars/8dce23eb51ff87b40fb162384932b98b.svg","isPro":false,"fullname":"Zhaohe Liao","user":"DiffusionModelsNeedBuff","type":"user"},{"_id":"61454a989cd783fec339bdd0","avatarUrl":"/avatars/39cc15c0a70e0d2b1f1ef1c7a98e7db8.svg","isPro":false,"fullname":"Xi Yang","user":"ianyeung","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15055.md"}">
DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models
Authors: ,
,
,
,
,
,
,
,
,
Abstract
DiffusionOPD enables efficient multi-task training for diffusion models through online policy distillation, outperforming existing reinforcement learning approaches in both training efficiency and final performance.
AI-generated summary
Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.15055 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.15055 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.15055 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.