Hugging Face Daily Papers · · 4 min read

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We propose <strong>Reward-Tilted Distribution Matching Distillation (RTDMD)</strong>, a<br>two-stage framework that unifies distribution-matching distillation with<br>reward-guided RL for few-step flow generators. Minimizing the KL divergence to<br>a <em>reward-tilted teacher distribution</em> decomposes naturally into a<br><strong>distribution-matching</strong> term and a <strong>reward-maximization</strong> term — instantiated<br>as <strong>Ambient-Consistent DMD (AC-DMD)</strong> for the cold start and a <strong>hybrid policy<br>gradient</strong> (SubGRPO + final-step reward back-propagation) for the RL stage.<br>With <strong>4 NFE</strong> RTDMD reaches new SOTA on SD3-M / SD3.5-M / FLUX.2 4B; the<br>distilled FLUX.2 4B even beats the full FLUX.2 9B teacher (50 NFE) on most<br>rewards.</p>\n","updatedAt":"2026-05-26T05:56:35.133Z","author":{"_id":"64b500fdf460afaefc5c64b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b500fdf460afaefc5c64b3/bYYyCXHTPUhsfw1HcPRPP.webp","fullname":"Yushi Huang","name":"Harahan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8104854226112366},"editors":["Harahan"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64b500fdf460afaefc5c64b3/bYYyCXHTPUhsfw1HcPRPP.webp"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26108","authors":[{"_id":"6a153563b57a1823d5708bdb","user":{"_id":"64b500fdf460afaefc5c64b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b500fdf460afaefc5c64b3/bYYyCXHTPUhsfw1HcPRPP.webp","isPro":false,"fullname":"Yushi Huang","user":"Harahan","type":"user","name":"Harahan"},"name":"Yushi Huang","status":"claimed_verified","statusLastChangedAt":"2026-05-26T07:08:39.378Z","hidden":false},{"_id":"6a153563b57a1823d5708bdc","name":"Xiangxin Zhou","hidden":false},{"_id":"6a153563b57a1823d5708bdd","name":"Ruoyu Wang","hidden":false},{"_id":"6a153563b57a1823d5708bde","name":"Chi Zhang","hidden":false},{"_id":"6a153563b57a1823d5708bdf","name":"Jun Zhang","hidden":false},{"_id":"6a153563b57a1823d5708be0","name":"Tianyu Pang","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Reinforcing Few-step Generators via Reward-Tilted Distribution Matching","submittedOnDailyBy":{"_id":"64b500fdf460afaefc5c64b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b500fdf460afaefc5c64b3/bYYyCXHTPUhsfw1HcPRPP.webp","isPro":false,"fullname":"Yushi Huang","user":"Harahan","type":"user","name":"Harahan"},"summary":"Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.","upvotes":2,"discussionId":"6a153563b57a1823d5708be1","githubRepo":"https://github.com/Harahan/RTDMD","githubRepoAddedBy":"user","ai_summary":"RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences.","ai_keywords":["diffusion distillation","reward-guided reinforcement learning","distribution matching distillation","reward-tilted teacher distribution","KL divergence","fake score objective","consistency regularizer","policy gradient","GRPO","SubGRPO"],"githubStars":3,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b500fdf460afaefc5c64b3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b500fdf460afaefc5c64b3/bYYyCXHTPUhsfw1HcPRPP.webp","isPro":false,"fullname":"Yushi Huang","user":"Harahan","type":"user"},{"_id":"6687f9a71309e08b1f84bdc6","avatarUrl":"/avatars/f947ec9fe620ae4cffa83b371acdd571.svg","isPro":false,"fullname":"MeiYi","user":"natalie5","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6645f953c39288df638dbdd5","name":"Tencent-Hunyuan","fullname":"Tencent Hunyuan","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62d22496c58f969c152bcefd/woKSjt2wXvBNKussyYPsa.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26108.md"}">
Papers
arxiv:2605.26108

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

Published on May 25
· Submitted by
Yushi Huang
on May 26
Authors:
,
,
,
,

Abstract

RTDMD is a two-stage framework that combines distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences.

AI-generated summary

Recent advances in few-step diffusion distillation have enabled efficient image generation, yet aligning these models with human preferences remains challenging. We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a two-stage framework that unifies distribution matching distillation with reward-guided reinforcement learning for few-step flow generators. We show that minimizing the KL divergence to a reward-tilted teacher distribution naturally decomposes into a distribution matching term and a reward maximization term. In the first stage, we introduce Ambient-Consistent Distribution Matching Distillation (AC-DMD), which performs subinterval-wise distribution matching and augments the fake score objective with a consistency regularizer to help the fake score model track the shifting generator distribution under limited updates. In the second stage, we jointly optimize both terms: for the reward maximization term, we derive a hybrid policy gradient that combines a GRPO-style estimator for the stochastic intermediate transitions with direct reward backpropagation through the deterministic final step, and further introduce step-subset GRPO (SubGRPO) to reduce variance. Experiments on SD3, SD3.5, and FLUX.2 demonstrate that RTDMD establishes new state-of-the-art results across preference, aesthetic, and compositional metrics with only 4 inference steps, outperforming previous few-step text-to-image generation methods. Code and models are available at https://github.com/Harahan/RTDMD.

Community

Paper author Paper submitter about 2 hours ago

We propose Reward-Tilted Distribution Matching Distillation (RTDMD), a
two-stage framework that unifies distribution-matching distillation with
reward-guided RL for few-step flow generators. Minimizing the KL divergence to
a reward-tilted teacher distribution decomposes naturally into a
distribution-matching term and a reward-maximization term — instantiated
as Ambient-Consistent DMD (AC-DMD) for the cold start and a hybrid policy
gradient
(SubGRPO + final-step reward back-propagation) for the RL stage.
With 4 NFE RTDMD reaches new SOTA on SD3-M / SD3.5-M / FLUX.2 4B; the
distilled FLUX.2 4B even beats the full FLUX.2 9B teacher (50 NFE) on most
rewards.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.26108
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26108 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26108 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers