Hugging Face Daily Papers · June 10, 2026 · 6 min read

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at <a href=\"https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO\" rel=\"nofollow\">https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO</a>.</p>\n","updatedAt":"2026-06-10T04:48:41.493Z","author":{"_id":"63d91b6d255ef6add20e1b38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675921369867-63d91b6d255ef6add20e1b38.jpeg","fullname":"Tianyu Pang","name":"P2333","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.9091269969940186},"editors":["P2333"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1675921369867-63d91b6d255ef6add20e1b38.jpeg"],"reactions":[],"isReport":false}},{"id":"6a292961f5c70abd8f265a04","author":{"_id":"677e51902f09432f000364fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Hh3O1A_NkdhkixMh6eqNx.png","fullname":"Harper Rogers","name":"smithcohn12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-06-10T09:07:45.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"I've experimented with PPO-based flow model training before, and replacing noisy ratio clipping with exact KL-based divergence constraints seems like a much more stable and efficient approach, especially for multi-objective optimization and longer training runs. \nhttps://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO [wordle unlimited](https://wordleunlimited.io)","html":"<p>I've experimented with PPO-based flow model training before, and replacing noisy ratio clipping with exact KL-based divergence constraints seems like a much more stable and efficient approach, especially for multi-objective optimization and longer training runs.<br><a href=\"https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO\" rel=\"nofollow\">https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO</a> <a href=\"https://wordleunlimited.io\" rel=\"nofollow\">wordle unlimited</a></p>\n","updatedAt":"2026-06-10T09:07:45.947Z","author":{"_id":"677e51902f09432f000364fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Hh3O1A_NkdhkixMh6eqNx.png","fullname":"Harper Rogers","name":"smithcohn12","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9274750351905823},"editors":["smithcohn12"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/Hh3O1A_NkdhkixMh6eqNx.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.11025","authors":[{"_id":"6a28eb07e7d78ea7587e5549","name":"Bowen Ping","hidden":false},{"_id":"6a28eb07e7d78ea7587e554a","name":"Xiangxin Zhou","hidden":false},{"_id":"6a28eb07e7d78ea7587e554b","name":"Penghui Qi","hidden":false},{"_id":"6a28eb07e7d78ea7587e554c","name":"Minnan Luo","hidden":false},{"_id":"6a28eb07e7d78ea7587e554d","name":"Liefeng Bo","hidden":false},{"_id":"6a28eb07e7d78ea7587e554e","name":"Tianyu Pang","hidden":false}],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models","submittedOnDailyBy":{"_id":"63d91b6d255ef6add20e1b38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675921369867-63d91b6d255ef6add20e1b38.jpeg","isPro":false,"fullname":"Tianyu Pang","user":"P2333","type":"user","name":"P2333"},"summary":"Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.","upvotes":30,"discussionId":"6a28eb08e7d78ea7587e554f","projectPage":"https://jayce-ping.github.io/Flow-DPPO-Project-Page/","ai_summary":"Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation.","ai_keywords":["online reinforcement learning","flow matching models","denoising process","Markov Decision Process","PPO-style ratio clipping","trust region","policy divergence","KL divergence","asymmetric divergence mask","catastrophic forgetting","multi-objective optimization","multi-epoch training"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"6a24e31c749f04abbbb5105d","name":"Tencent-Hunyuan-Multimodal-RL","fullname":"Tencent-Hunyuan-Multimodal-RL","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66129c7b50350afe76757262/AMk8iMwAaEjA0M5Cdq0NE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63d91b6d255ef6add20e1b38","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1675921369867-63d91b6d255ef6add20e1b38.jpeg","isPro":false,"fullname":"Tianyu Pang","user":"P2333","type":"user"},{"_id":"66129c7b50350afe76757262","avatarUrl":"/avatars/a2f4fac076b9d658a0d904ed54960f6f.svg","isPro":false,"fullname":"Xiangxin Zhou","user":"zhouxiangxin","type":"user"},{"_id":"63885f1d0bebb233d8ad6e5b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669881620925-noauth.jpeg","isPro":false,"fullname":"Penghui Qi","user":"QPHutu","type":"user"},{"_id":"66f8689725464a7989b75845","avatarUrl":"/avatars/43a61a528c5779103eaf5687ba44ee14.svg","isPro":false,"fullname":"Jiarui Yao","user":"FlippyDora","type":"user"},{"_id":"649369b34f0e40ee1a0ed5ba","avatarUrl":"/avatars/50d0e77883579d5002906c8d29c26ec5.svg","isPro":false,"fullname":"Maxwell Yao","user":"MaxwellJryao","type":"user"},{"_id":"642e7a12ccdcf5da7f9657a0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e7a12ccdcf5da7f9657a0/w8jW5BagTuTp6EvC6KEyR.png","isPro":true,"fullname":"Jiaqi Tang","user":"Jiaqi-hkust","type":"user"},{"_id":"6496b06a4a9a7e1fe4253ae2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/144NlRW_ETmmOgSYUs_SM.png","isPro":false,"fullname":"Haonan Wang","user":"haonan3","type":"user"},{"_id":"644fe6a9e1d7a97f3b66e906","avatarUrl":"/avatars/ad1a45f0b1c8a4d03ba87f2a3ce5a8f8.svg","isPro":false,"fullname":"Yuanming-Li","user":"Lymann","type":"user"},{"_id":"6486b09e8315b19342f0bf5e","avatarUrl":"/avatars/bc5f22f231c884146d373fe1042d81bd.svg","isPro":false,"fullname":"Xiangyan Liu","user":"xyliu6","type":"user"},{"_id":"6541fa406be058da06580347","avatarUrl":"/avatars/fc217146d5ec611bd7f4fb355a5939b3.svg","isPro":false,"fullname":"Wu","user":"Hai-Tao","type":"user"},{"_id":"64b76c8453d91a364aae131f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b76c8453d91a364aae131f/4fHfPJ8QT8zVssBAnDPQ1.png","isPro":false,"fullname":"Lvfang Tao","user":"MeowFET","type":"user"},{"_id":"6710815a07325c4b0ad7b6d4","avatarUrl":"/avatars/a6012a4ee9bdea862cea28ada5e506a0.svg","isPro":true,"fullname":"He Guangxin","user":"gxhe","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a24e31c749f04abbbb5105d","name":"Tencent-Hunyuan-Multimodal-RL","fullname":"Tencent-Hunyuan-Multimodal-RL","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/66129c7b50350afe76757262/AMk8iMwAaEjA0M5Cdq0NE.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.11025.md"}">

Papers

arxiv:2606.11025

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Published on Jun 9

· Submitted by

Tianyu Pang on Jun 10

Tencent-Hunyuan-Multimodal-RL

Upvote

Authors:

Abstract

Flow-DPPO replaces ratio clipping with divergence proximal constraints in flow matching models, improving training stability and multi-objective optimization through exact KL divergence computation.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

View arXiv page View PDF Project page Add to collection

Community

P2333

Paper submitter about 12 hours ago

•

edited about 12 hours ago

smithcohn12

about 8 hours ago

I've experimented with PPO-based flow model training before, and replacing noisy ratio clipping with exact KL-based divergence constraints seems like a much more stable and efficient approach, especially for multi-objective optimization and longer training runs.
https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO wordle unlimited

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.11025

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.11025 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.11025 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.11025 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers