Hugging Face Daily Papers · · 5 min read

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

TLDR: The paper argues that RL helps flow models because rewards provide a more aligned optimization landscape than flow matching for many aspects of the data, like perceptual features. It turns this into a method by training a discriminator in SSL feature space and using its logit as a reward. This improves FID/feature-space FD, boosts held-out preference rewards without training on them, and helps later preference-based RL. It is validated on SiT, REPA, JiT, and RAE.</p>\n","updatedAt":"2026-06-18T15:00:19.628Z","author":{"_id":"658d973dd07df8a8d22031ab","avatarUrl":"/avatars/42722c039cfc4e1aef5c9978b04fdc7e.svg","fullname":"Nicolas Beltran-Velez","name":"velezbeltran","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9281818270683289},"editors":["velezbeltran"],"editorAvatarUrls":["/avatars/42722c039cfc4e1aef5c9978b04fdc7e.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.19162","authors":[{"_id":"6a3403f2fc3a8b1102d94407","name":"Nicolas Beltran-Velez","hidden":false},{"_id":"6a3403f2fc3a8b1102d94408","name":"Felix Friedrich","hidden":false},{"_id":"6a3403f2fc3a8b1102d94409","name":"Zhang Xiaofeng","hidden":false},{"_id":"6a3403f2fc3a8b1102d9440a","name":"Reyhane Askari-Hemmat","hidden":false},{"_id":"6a3403f2fc3a8b1102d9440b","name":"Xiaochuang Han","hidden":false},{"_id":"6a3403f2fc3a8b1102d9440c","name":"Adriana Romero-Soriano","hidden":false},{"_id":"6a3403f2fc3a8b1102d9440d","name":"Michal Drozdzal","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL","submittedOnDailyBy":{"_id":"658d973dd07df8a8d22031ab","avatarUrl":"/avatars/42722c039cfc4e1aef5c9978b04fdc7e.svg","isPro":false,"fullname":"Nicolas Beltran-Velez","user":"velezbeltran","type":"user","name":"velezbeltran"},"summary":"Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure ell_2 regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations.\n We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., 9.38 to 2.62 on SiT) and semantic-space FD (e.g., 88.2 to 19.3 on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.","upvotes":5,"discussionId":"6a3403f2fc3a8b1102d9440e","ai_summary":"Discriminator-Guided Reinforcement Learning (DRL) addresses alignment issues in score- and flow-matching models by using a pretrained representation space discriminator as an optimal reward signal, improving both visual fidelity and semantic quality without human preferences.","ai_keywords":["preference-based reinforcement learning","score-matching models","flow-matching models","visual realism","coherent object structure","matching losses","$\\ell_2$ regression error","velocity field","score field","reward alignment","KL-regularized reinforcement learning","discriminator-guided RL","pretrained representation space","log-likelihood ratio","FID","semantic-space FD","DINOv3","Pareto frontier"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592839207516-noauth.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"658d973dd07df8a8d22031ab","avatarUrl":"/avatars/42722c039cfc4e1aef5c9978b04fdc7e.svg","isPro":false,"fullname":"Nicolas Beltran-Velez","user":"velezbeltran","type":"user"},{"_id":"62e7dd4036a8e8a82700041c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e7dd4036a8e8a82700041c/Dgk9mXYLVd4LpiNLWjn-q.jpeg","isPro":false,"fullname":"Felix Friedrich","user":"felfri","type":"user"},{"_id":"617219c307d047d6476a2772","avatarUrl":"/avatars/10d9bf8d889c99988b605c34b1454f17.svg","isPro":false,"fullname":"Guillaume Zhang","user":"GuillaumeZ","type":"user"},{"_id":"65e0dcaf35191b15a3873835","avatarUrl":"/avatars/65d44808417b133cf83216122a1b511e.svg","isPro":false,"fullname":"Sweta Karlekar","user":"swkarlekar","type":"user"},{"_id":"635ea3969f24f6db0a1e2d0b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635ea3969f24f6db0a1e2d0b/RxaJKAx4MYMShb58yuUxy.png","isPro":false,"fullname":"Xiaochuang Han","user":"xhan77","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592839207516-noauth.png"},"query":{}}">
Papers
arxiv:2606.19162

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

Published on Jun 17
· Submitted by
Nicolas Beltran-Velez
on Jun 18
Authors:
,
,
,
,
,
,

Abstract

Discriminator-Guided Reinforcement Learning (DRL) addresses alignment issues in score- and flow-matching models by using a pretrained representation space discriminator as an optimal reward signal, improving both visual fidelity and semantic quality without human preferences.

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure ell_2 regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., 9.38 to 2.62 on SiT) and semantic-space FD (e.g., 88.2 to 19.3 on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

Community

TLDR: The paper argues that RL helps flow models because rewards provide a more aligned optimization landscape than flow matching for many aspects of the data, like perceptual features. It turns this into a method by training a discriminator in SSL feature space and using its logit as a reward. This improves FID/feature-space FD, boosts held-out preference rewards without training on them, and helps later preference-based RL. It is validated on SiT, REPA, JiT, and RAE.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.19162 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.19162 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.19162 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers