Hugging Face Daily Papers · June 18, 2026 · 5 min read

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

TLDR: The paper argues that RL helps flow models because rewards provide a more aligned optimization landscape than flow matching for many aspects of the data, like perceptual features. It turns this into a method by training a discriminator in SSL feature space and using its logit as a reward. This improves FID/feature-space FD, boosts held-out preference rewards without training on them, and helps later preference-based RL. It is validated on SiT, REPA, JiT, and RAE.</p>\n","updatedAt":"2026-06-18T15:00:19.628Z","author":{"_id":"658d973dd07df8a8d22031ab","avatarUrl":"/avatars/42722c039cfc4e1aef5c9978b04fdc7e.svg","fullname":"Nicolas Beltran-Velez","name":"velezbeltran","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9281818270683289},"editors":["velezbeltran"],"editorAvatarUrls":["/avatars/42722c039cfc4e1aef5c9978b04fdc7e.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.19162","authors":[{"_id":"6a3403f2fc3a8b1102d94407","name":"Nicolas Beltran-Velez","hidden":false},{"_id":"6a3403f2fc3a8b1102d94408","name":"Felix Friedrich","hidden":false},{"_id":"6a3403f2fc3a8b1102d94409","name":"Zhang Xiaofeng","hidden":false},{"_id":"6a3403f2fc3a8b1102d9440a","name":"Reyhane Askari-Hemmat","hidden":false},{"_id":"6a3403f2fc3a8b1102d9440b","name":"Xiaochuang Han","hidden":false},{"_id":"6a3403f2fc3a8b1102d9440c","name":"Adriana Romero-Soriano","hidden":false},{"_id":"6a3403f2fc3a8b1102d9440d","name":"Michal Drozdzal","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL","submittedOnDailyBy":{"_id":"658d973dd07df8a8d22031ab","avatarUrl":"/avatars/42722c039cfc4e1aef5c9978b04fdc7e.svg","isPro":false,"fullname":"Nicolas Beltran-Velez","user":"velezbeltran","type":"user","name":"velezbeltran"},"summary":"Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure ell_2 regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations.\n We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., 9.38 to 2.62 on SiT) and semantic-space FD (e.g., 88.2 to 19.3 on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.","upvotes":5,"discussionId":"6a3403f2fc3a8b1102d9440e","ai_summary":"Discriminator-Guided Reinforcement Learning (DRL) addresses alignment issues in score- and flow-matching models by using a pretrained representation space discriminator as an optimal reward signal, improving both visual fidelity and semantic quality without human preferences.","ai_keywords":["preference-based reinforcement learning","score-matching models","flow-matching models","visual realism","coherent object structure","matching losses","$\\ell_2$ regression error","velocity field","score field","reward alignment","KL-regularized reinforcement learning","discriminator-guided RL","pretrained representation space","log-likelihood ratio","FID","semantic-space FD","DINOv3","Pareto frontier"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592839207516-noauth.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"658d973dd07df8a8d22031ab","avatarUrl":"/avatars/42722c039cfc4e1aef5c9978b04fdc7e.svg","isPro":false,"fullname":"Nicolas Beltran-Velez","user":"velezbeltran","type":"user"},{"_id":"62e7dd4036a8e8a82700041c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e7dd4036a8e8a82700041c/Dgk9mXYLVd4LpiNLWjn-q.jpeg","isPro":false,"fullname":"Felix Friedrich","user":"felfri","type":"user"},{"_id":"617219c307d047d6476a2772","avatarUrl":"/avatars/10d9bf8d889c99988b605c34b1454f17.svg","isPro":false,"fullname":"Guillaume Zhang","user":"GuillaumeZ","type":"user"},{"_id":"65e0dcaf35191b15a3873835","avatarUrl":"/avatars/65d44808417b133cf83216122a1b511e.svg","isPro":false,"fullname":"Sweta Karlekar","user":"swkarlekar","type":"user"},{"_id":"635ea3969f24f6db0a1e2d0b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/635ea3969f24f6db0a1e2d0b/RxaJKAx4MYMShb58yuUxy.png","isPro":false,"fullname":"Xiaochuang Han","user":"xhan77","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e63d8713071d5be688861b8","name":"facebook","fullname":"AI at Meta","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1592839207516-noauth.png"},"query":{}}">

Papers

arxiv:2606.19162

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

Published on Jun 17

· Submitted by

Nicolas Beltran-Velez on Jun 18

AI at Meta

Upvote

Authors:

Abstract

Discriminator-Guided Reinforcement Learning (DRL) addresses alignment issues in score- and flow-matching models by using a pretrained representation space discriminator as an optimal reward signal, improving both visual fidelity and semantic quality without human preferences.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Score- and flow-matching models often rely on preference-based reinforcement learning for two purposes: aligning with subjective preferences and, surprisingly, recovering properties such as visual realism and coherent object structure that matching-based training is intended to learn from the data itself. We argue that this reflects a structural mismatch. Matching losses measure ell_2 regression error on the velocity or score field under training-time marginals, a proxy poorly aligned with the visual and semantic properties that determine sample quality at inference. Given a reward aligned with these properties, RL sidesteps the mismatch by evaluating the model on its own samples and following the reward landscape directly. The challenge is to obtain such a reward without relying on human preferences, which are expensive and conflate data realism with annotator inclinations. We propose Discriminator-Guided RL (DRL). DRL trains a discriminator to separate data from base-model samples in a pretrained representation space and uses its logit as the reward in KL-regularized RL. The pretrained space restricts the discriminator to perceptually meaningful directions, and the logit estimates the log-likelihood ratio between data and model, which is the optimal reward for targeting the data distribution. Across SiT, JiT, REPA, and RAE, DRL reduces guidance-free FID (e.g., 9.38 to 2.62 on SiT) and semantic-space FD (e.g., 88.2 to 19.3 on DINOv3 for SiT), with consistent gains across all backbones, and improves human-preference rewards without training on them. It also yields a better Pareto frontier between preference reward and image fidelity under subsequent preference-based post-training, increasing alignment while reducing low-level artifacts such as oversaturation and excessive brightness.

View arXiv page View PDF Add to collection

Community

velezbeltran

Paper submitter about 3 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.19162 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.19162 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.19162 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers