We propose <strong>Reward-Tilted Distribution Matching Distillation (RTDMD)</strong>, a<br>two-stage framework that unifies distribution-matching distillation with<br>reward-guided RL for few-step flow generators. Minimizing the KL divergence to<br>a <em>reward-tilted teacher distribution</em> decomposes naturally into a<br><strong>distribution-matching</strong> term and a <strong>reward-maximization</strong> term — instantiated<br>as <strong>Ambient-Consistent DMD (AC-DMD)</strong> for the cold start and a <strong>hybrid policy<br>gradient</strong> (SubGRPO + final-step reward back-propagation) for the RL stage.<br>With <strong>4 NFE</strong> RTDMD reaches new SOTA on SD3-M / SD3.5-M / FLUX.2 4B; the<br>distilled FLUX.2 4B even beats the full FLUX.2 9B teacher (50 NFE) on most<br>rewards.</p>\n","updatedAt":"2026-05-26T05:56:35.133Z","author":{"_id":"64b500fdf460afaefc5c64b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b500fdf460afaefc5c64b3/bYYyCXHTPUhsfw1HcPRPP.webp","fullname":"Yushi Huang","name":"Harahan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8104854226112366},"editors":["Harahan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64b500fdf460afaefc5c64b3/bYYyCXHTPUhsfw1HcPRPP.webp"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26108","authors":[{"_id":"6a153563b57a1823d5708bdb","user":{"_id":"64b500fdf460afaefc5c64b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b500fdf460afaefc5c64b3/bYYyCXHTPUhsfw1HcPRPP.webp","isPro":false,"fullname":"Yushi Huang","user":"Harahan","type":"user","name":"Harahan"},"name":"Yushi Huang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:08:39.378Z","hidden":false},{"_id":"6a153563b57a1823d5708bdc","name":"Xiangxin Zhou","hidden":false},{"_id":"6a153563b57a1823d5708bdd","name":"Ruoyu Wang","hidden":false},{"_id":"6a153563b57a1823d5708bde","name":"Chi Zhang","hidden":false},{"_id":"6a153563b57a1823d5708bdf","name":"Jun Zhang","hidden":false},{"_id":"6a153563b57a1823d5708be0","name":"Tianyu Pang","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Reinforcing Few-step Generators via Reward-Tilted Distribution Matching","submittedOnDailyBy":{"_id":"64b500fdf460afaefc5c64b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b500fdf460afaefc5c64b3/bYYyCXHTPUhsfw1HcPRPP.webp","isPro":false,"fullname":"Yushi Huang","user":"Harahan","type":"user","name":"Harahan"},"summary":"Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.","upvotes":2,"discussionId":"6a153563b57a1823d5708be1","githubRepo":"https://github.com/Harahan/RTDMD","githubRepoAddedBy":"user","ai_summary":"RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences.","ai_keywords":["diffusion distillation","reward-guided reinforcement learning","distribution matching distillation","reward-tilted teacher distribution","KL divergence","fake score objective","consistency regularizer","policy gradient","GRPO","SubGRPO"],"githubStars":3,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b500fdf460afaefc5c64b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b500fdf460afaefc5c64b3/bYYyCXHTPUhsfw1HcPRPP.webp","isPro":false,"fullname":"Yushi Huang","user":"Harahan","type":"user"},{"_id":"6687f9a71309e08b1f84bdc6","avatarUrl":"/avatars/f947ec9fe620ae4cffa83b371acdd571.svg","isPro":false,"fullname":"MeiYi","user":"natalie5","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26108.md"}">
Reinforcing Few-step Generators via Reward-Tilted Distribution Matching
Abstract
RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences.
AI-generated summary
Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.
Community
We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a
two-stage framework that unifies distribution-matching distillation with
reward-guided RL for few-step flow generators. Minimizing the KL divergence to
a reward-tilted teacher distribution decomposes naturally into a
distribution-matching term and a reward-maximization term — instantiated
as Ambient-Consistent DMD (AC-DMD) for the cold start and a hybrid policy
gradient (SubGRPO + final-step reward back-propagation) for the RL stage.
With 4 NFE RTDMD reaches new SOTA on SD3-M / SD3.5-M / FLUX.2 4B; the
distilled FLUX.2 4B even beats the full FLUX.2 9B teacher (50 NFE) on most
rewards.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.26108 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.26108 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.