Hugging Face Daily Papers · · 5 min read

Stitched Value Model for Diffusion Alignment

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Flow-GRPO, DiffusionNFT, DPS, FK Steering — they all hit the same wall: pixel-space reward models require expensive rollouts or one-step denoising. 🧩 Can we score noisy diffusion latents directly, without losing pixel-reward quality?</p>\n<p>Meet StitchVM.<br>➡️ Paper: <a href=\"https://lnkd.in/eb24Ui8y\" rel=\"nofollow\">https://lnkd.in/eb24Ui8y</a><br>➡️ Website: <a href=\"https://lnkd.in/euPcBrya\" rel=\"nofollow\">https://lnkd.in/euPcBrya</a></p>\n<p>👏 Collaboration between ETH Zürich, Google, University of Copenhagen (Københavns Universitet).</p>\n<p>💡 Key idea</p>\n<p>We stitch a pretrained diffusion backbone with a pretrained pixel reward model using a small stitching layer, then briefly finetune.<br>a model that judges noisy latents at pixel-reward quality — with only a small finetuning cost.</p>\n","updatedAt":"2026-05-21T05:17:48.054Z","author":{"_id":"630d5aff7dacb93b3359fe26","avatarUrl":"/avatars/e69c2bcb9d45131b68defae1bdb9d0cc.svg","fullname":"Hyojun Go ","name":"gohyojun15","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6920340657234192},"editors":["gohyojun15"],"editorAvatarUrls":["/avatars/e69c2bcb9d45131b68defae1bdb9d0cc.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.19804","authors":[{"_id":"6a0e94d0164dbbc68a26c634","name":"Hyojun Go","hidden":false},{"_id":"6a0e94d0164dbbc68a26c635","name":"Hyungjin Chung","hidden":false},{"_id":"6a0e94d0164dbbc68a26c636","name":"Prune Truong","hidden":false},{"_id":"6a0e94d0164dbbc68a26c637","name":"Goutam Bhat","hidden":false},{"_id":"6a0e94d0164dbbc68a26c638","name":"Li Mi","hidden":false},{"_id":"6a0e94d0164dbbc68a26c639","name":"Zhaochong An","hidden":false},{"_id":"6a0e94d0164dbbc68a26c63a","name":"Zixiang Zhao","hidden":false},{"_id":"6a0e94d0164dbbc68a26c63b","name":"Dominik Narnhofer","hidden":false},{"_id":"6a0e94d0164dbbc68a26c63c","name":"Serge Belongie","hidden":false},{"_id":"6a0e94d0164dbbc68a26c63d","name":"Federico Tombari","hidden":false},{"_id":"6a0e94d0164dbbc68a26c63e","name":"Konrad Schindler","hidden":false}],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"Stitched Value Model for Diffusion Alignment","submittedOnDailyBy":{"_id":"630d5aff7dacb93b3359fe26","avatarUrl":"/avatars/e69c2bcb9d45131b68defae1bdb9d0cc.svg","isPro":false,"fullname":"Hyojun Go ","user":"gohyojun15","type":"user","name":"gohyojun15"},"summary":"For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes 3.2times faster while halving peak GPU memory, and DiffusionNFT becomes 2.3times faster.","upvotes":5,"discussionId":"6a0e94d0164dbbc68a26c63f","projectPage":"https://github.com/prs-eth/StitchVM","ai_summary":"StitchVM efficiently transfers pretrained pixel-space reward models to noisy latent spaces for diffusion model alignment through a lightweight model stitching framework.","ai_keywords":["diffusion models","flow-based generative models","reward models","noisy latents","pixel space","diffusion backbone","model stitching","CLIP ViT-L","SD 3.5 Medium","DPS","DiffusionNFT"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630d5aff7dacb93b3359fe26","avatarUrl":"/avatars/e69c2bcb9d45131b68defae1bdb9d0cc.svg","isPro":false,"fullname":"Hyojun Go ","user":"gohyojun15","type":"user"},{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","isPro":false,"fullname":"wangshuai","user":"wangsssssss","type":"user"},{"_id":"649f65a4ca03a1a35e3dac14","avatarUrl":"/avatars/b0dcd8ad795b1e666ee247b2ac024d53.svg","isPro":false,"fullname":"Hyojun GO","user":"HJGO","type":"user"},{"_id":"65e5eae6958b39864e8b683e","avatarUrl":"/avatars/b6a857e7b725767197dd95bc876f8ad1.svg","isPro":false,"fullname":"Zhaochong An","user":"ZhaochongAn","type":"user"},{"_id":"67e9fc3797cd6860c81d5838","avatarUrl":"/avatars/6c37731156bf52c123bd390823890d28.svg","isPro":false,"fullname":"Jangho Park","user":"jhpark96","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.19804.md"}">
Papers
arxiv:2605.19804

Stitched Value Model for Diffusion Alignment

Published on May 19
· Submitted by
Hyojun Go
on May 21
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

StitchVM efficiently transfers pretrained pixel-space reward models to noisy latent spaces for diffusion model alignment through a lightweight model stitching framework.

AI-generated summary

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes 3.2times faster while halving peak GPU memory, and DiffusionNFT becomes 2.3times faster.

Community

Paper submitter about 8 hours ago

Flow-GRPO, DiffusionNFT, DPS, FK Steering — they all hit the same wall: pixel-space reward models require expensive rollouts or one-step denoising. 🧩 Can we score noisy diffusion latents directly, without losing pixel-reward quality?

Meet StitchVM.
➡️ Paper: https://lnkd.in/eb24Ui8y
➡️ Website: https://lnkd.in/euPcBrya

👏 Collaboration between ETH Zürich, Google, University of Copenhagen (Københavns Universitet).

💡 Key idea

We stitch a pretrained diffusion backbone with a pretrained pixel reward model using a small stitching layer, then briefly finetune.
a model that judges noisy latents at pixel-reward quality — with only a small finetuning cost.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.19804
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.19804 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.19804 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.19804 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers