Hugging Face Daily Papers · May 15, 2026 · 3 min read

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Project page: <a href=\"https://quanhaol.github.io/DiffusionOPD-site/\" rel=\"nofollow\">https://quanhaol.github.io/DiffusionOPD-site/</a></p>\n","updatedAt":"2026-05-15T03:23:42.276Z","author":{"_id":"65818917554131ec95df3829","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65818917554131ec95df3829/YIbgLCllm-0u88pRLhpVZ.jpeg","fullname":"Quanhao Li","name":"quanhaol","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.23114795982837677},"editors":["quanhaol"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65818917554131ec95df3829/YIbgLCllm-0u88pRLhpVZ.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15055","authors":[{"_id":"6a069141b1a8cbabc9f0994e","name":"Quanhao Li","hidden":false},{"_id":"6a069141b1a8cbabc9f0994f","name":"Junqiu Yu","hidden":false},{"_id":"6a069141b1a8cbabc9f09950","name":"Kaixun Jiang","hidden":false},{"_id":"6a069141b1a8cbabc9f09951","name":"Yujie Wei","hidden":false},{"_id":"6a069141b1a8cbabc9f09952","name":"Zhen Xing","hidden":false},{"_id":"6a069141b1a8cbabc9f09953","name":"Pandeng Li","hidden":false},{"_id":"6a069141b1a8cbabc9f09954","name":"Ruihang Chu","hidden":false},{"_id":"6a069141b1a8cbabc9f09955","name":"Shiwei Zhang","hidden":false},{"_id":"6a069141b1a8cbabc9f09956","name":"Yu Liu","hidden":false},{"_id":"6a069141b1a8cbabc9f09957","name":"Zuxuan Wu","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models","submittedOnDailyBy":{"_id":"65818917554131ec95df3829","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65818917554131ec95df3829/YIbgLCllm-0u88pRLhpVZ.jpeg","isPro":false,"fullname":"Quanhao Li","user":"quanhaol","type":"user","name":"quanhaol"},"summary":"Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.","upvotes":14,"discussionId":"6a069141b1a8cbabc9f09958","projectPage":"https://quanhaol.github.io/DiffusionOPD-site/","ai_summary":"DiffusionOPD enables efficient multi-task training for diffusion models through online policy distillation, outperforming existing reinforcement learning approaches in both training efficiency and final performance.","ai_keywords":["diffusion models","reinforcement learning","multi-task training","online policy distillation","task-specific teachers","unified student","stochastic SDE","deterministic ODE","KL objective","PPO-style policy gradients"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65818917554131ec95df3829","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65818917554131ec95df3829/YIbgLCllm-0u88pRLhpVZ.jpeg","isPro":false,"fullname":"Quanhao Li","user":"quanhaol","type":"user"},{"_id":"637f70d6fab5db9101c3dfc8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/637f70d6fab5db9101c3dfc8/NgkYNXWLDavLbrnCby2Fl.jpeg","isPro":false,"fullname":"Yujie Wei","user":"weilllllls","type":"user"},{"_id":"6461e09759daabed7575b7a2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6461e09759daabed7575b7a2/sxfko49Q7ta0dCfgrpqoB.jpeg","isPro":false,"fullname":"PinxueGuo","user":"PinxueGuo","type":"user"},{"_id":"6547024a5cd5692b3abe33a9","avatarUrl":"/avatars/e638a421c03f6f42f973354b2c4c9e14.svg","isPro":false,"fullname":"minghao han","user":"minghaofdu","type":"user"},{"_id":"6690d9c2a534fc4f3c5a2ea8","avatarUrl":"/avatars/7ce50e287205ed10be66d0ae3df597a4.svg","isPro":false,"fullname":"MichaelYu","user":"MichaelYu781","type":"user"},{"_id":"667a7f1d3b78e49a81ab02c2","avatarUrl":"/avatars/b60cb5b9070c7573a7f241407d705ecc.svg","isPro":false,"fullname":"Zhihang Liu","user":"lntzm","type":"user"},{"_id":"6332cd69fdc3d1fc8058515c","avatarUrl":"/avatars/83682bcac1f841adc4de50e2961ec624.svg","isPro":false,"fullname":"zhen xing","user":"Chenhsing","type":"user"},{"_id":"642e3bcb958faf258a40e89c","avatarUrl":"/avatars/dad142df2217f8eed1f45c9e7287d3ea.svg","isPro":false,"fullname":"Ruihang Chu","user":"Ruihang","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"66b2e9ed5409fe4fb5c7c24e","avatarUrl":"/avatars/c1fa329c0f9d26ee0e68d994b53f2679.svg","isPro":false,"fullname":"panda","user":"hughug774","type":"user"},{"_id":"63315fbaacb6472115aaee86","avatarUrl":"/avatars/8dce23eb51ff87b40fb162384932b98b.svg","isPro":false,"fullname":"Zhaohe Liao","user":"DiffusionModelsNeedBuff","type":"user"},{"_id":"61454a989cd783fec339bdd0","avatarUrl":"/avatars/39cc15c0a70e0d2b1f1ef1c7a98e7db8.svg","isPro":false,"fullname":"Xi Yang","user":"ianyeung","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15055.md"}">

Papers

arxiv:2605.15055

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

Published on May 14

· Submitted by

Quanhao Li on May 15

Upvote

Authors:

Abstract

DiffusionOPD enables efficient multi-task training for diffusion models through online policy distillation, outperforming existing reinforcement learning approaches in both training efficiency and final performance.

AI-generated summary

Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.