Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at <a href=\"https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO\" rel=\"nofollow\">https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO</a>.</p>\n","updatedAt":"2026-06-10T04:48:41.493Z","author":{"_id":"63d91b6d255ef6add20e1b38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675921369867-63d91b6d255ef6add20e1b38.jpeg","fullname":"Tianyu Pang","name":"P2333","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9091269969940186},"editors":["P2333"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1675921369867-63d91b6d255ef6add20e1b38.jpeg"],"reactions":[],"isReport":false}},{"id":"6a292961f5c70abd8f265a04","author":{"_id":"677e51902f09432f000364fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Hh3O1A_NkdhkixMh6eqNx.png","fullname":"Harper Rogers","name":"smithcohn12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-10T09:07:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I've experimented with PPO-based flow model training before, and replacing noisy ratio clipping with exact KL-based divergence constraints seems like a much more stable and efficient approach, especially for multi-objective optimization and longer training runs. \nhttps://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO [wordle unlimited](https://wordleunlimited.io)","html":"<p>I've experimented with PPO-based flow model training before, and replacing noisy ratio clipping with exact KL-based divergence constraints seems like a much more stable and efficient approach, especially for multi-objective optimization and longer training runs.<br><a href=\"https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO\" rel=\"nofollow\">https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO</a> <a href=\"https://wordleunlimited.io\" rel=\"nofollow\">wordle unlimited</a></p>\n","updatedAt":"2026-06-10T09:07:45.947Z","author":{"_id":"677e51902f09432f000364fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Hh3O1A_NkdhkixMh6eqNx.png","fullname":"Harper Rogers","name":"smithcohn12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9274750351905823},"editors":["smithcohn12"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Hh3O1A_NkdhkixMh6eqNx.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11025","authors":[{"_id":"6a28eb07e7d78ea7587e5549","name":"Bowen Ping","hidden":false},{"_id":"6a28eb07e7d78ea7587e554a","name":"Xiangxin Zhou","hidden":false},{"_id":"6a28eb07e7d78ea7587e554b","name":"Penghui Qi","hidden":false},{"_id":"6a28eb07e7d78ea7587e554c","name":"Minnan Luo","hidden":false},{"_id":"6a28eb07e7d78ea7587e554d","name":"Liefeng Bo","hidden":false},{"_id":"6a28eb07e7d78ea7587e554e","name":"Tianyu Pang","hidden":false}],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models","submittedOnDailyBy":{"_id":"63d91b6d255ef6add20e1b38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675921369867-63d91b6d255ef6add20e1b38.jpeg","isPro":false,"fullname":"Tianyu Pang","user":"P2333","type":"user","name":"P2333"},"summary":"Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.","upvotes":30,"discussionId":"6a28eb08e7d78ea7587e554f","projectPage":"https://jayce-ping.github.io/Flow-DPPO-Project-Page/","ai_summary":"Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation.","ai_keywords":["online reinforcement learning","flow matching models","denoising process","Markov Decision Process","PPO-style ratio clipping","trust region","policy divergence","KL divergence","asymmetric divergence mask","catastrophic forgetting","multi-objective optimization","multi-epoch training"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"6a24e31c749f04abbbb5105d","name":"Tencent-Hunyuan-Multimodal-RL","fullname":"Tencent-Hunyuan-Multimodal-RL","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66129c7b50350afe76757262/AMk8iMwAaEjA0M5Cdq0NE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63d91b6d255ef6add20e1b38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675921369867-63d91b6d255ef6add20e1b38.jpeg","isPro":false,"fullname":"Tianyu Pang","user":"P2333","type":"user"},{"_id":"66129c7b50350afe76757262","avatarUrl":"/avatars/a2f4fac076b9d658a0d904ed54960f6f.svg","isPro":false,"fullname":"Xiangxin Zhou","user":"zhouxiangxin","type":"user"},{"_id":"63885f1d0bebb233d8ad6e5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669881620925-noauth.jpeg","isPro":false,"fullname":"Penghui Qi","user":"QPHutu","type":"user"},{"_id":"66f8689725464a7989b75845","avatarUrl":"/avatars/43a61a528c5779103eaf5687ba44ee14.svg","isPro":false,"fullname":"Jiarui Yao","user":"FlippyDora","type":"user"},{"_id":"649369b34f0e40ee1a0ed5ba","avatarUrl":"/avatars/50d0e77883579d5002906c8d29c26ec5.svg","isPro":false,"fullname":"Maxwell Yao","user":"MaxwellJryao","type":"user"},{"_id":"642e7a12ccdcf5da7f9657a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png","isPro":true,"fullname":"Jiaqi Tang","user":"Jiaqi-hkust","type":"user"},{"_id":"6496b06a4a9a7e1fe4253ae2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/144NlRW_ETmmOgSYUs_SM.png","isPro":false,"fullname":"Haonan Wang","user":"haonan3","type":"user"},{"_id":"644fe6a9e1d7a97f3b66e906","avatarUrl":"/avatars/ad1a45f0b1c8a4d03ba87f2a3ce5a8f8.svg","isPro":false,"fullname":"Yuanming-Li","user":"Lymann","type":"user"},{"_id":"6486b09e8315b19342f0bf5e","avatarUrl":"/avatars/bc5f22f231c884146d373fe1042d81bd.svg","isPro":false,"fullname":"Xiangyan Liu","user":"xyliu6","type":"user"},{"_id":"6541fa406be058da06580347","avatarUrl":"/avatars/fc217146d5ec611bd7f4fb355a5939b3.svg","isPro":false,"fullname":"Wu","user":"Hai-Tao","type":"user"},{"_id":"64b76c8453d91a364aae131f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b76c8453d91a364aae131f/4fHfPJ8QT8zVssBAnDPQ1.png","isPro":false,"fullname":"Lvfang Tao","user":"MeowFET","type":"user"},{"_id":"6710815a07325c4b0ad7b6d4","avatarUrl":"/avatars/a6012a4ee9bdea862cea28ada5e506a0.svg","isPro":true,"fullname":"He Guangxin","user":"gxhe","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a24e31c749f04abbbb5105d","name":"Tencent-Hunyuan-Multimodal-RL","fullname":"Tencent-Hunyuan-Multimodal-RL","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66129c7b50350afe76757262/AMk8iMwAaEjA0M5Cdq0NE.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11025.md"}">
Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models
Abstract
Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation.
Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.
Community
Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.11025 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.11025 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.11025 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.