Tech report.</p>\n<p><strong>Z-Reward</strong> is a reasoning-internalized teacher-student reward modeling framework for visual generation, developed by the <strong>Z-Image</strong> Team for ⚡-Image.</p>\n<p>Z-Reward decouples reasoning-heavy judgment from efficient reward deployment:</p>\n<ul>\n<li><p>🧠 <strong>The Teacher (27B):</strong> A large VLM that uses reasoning to infer rubric-aligned score distributions. Trained with <strong>Group-wise Direct Score Optimization (GDSO)</strong>, it reaches <strong>89.6% human preference accuracy</strong> on our internally annotated evaluation set.</p>\n</li>\n<li><p>⚡ <strong>The Student (9B):</strong> Trained with <strong>Reasoning-Internalized Score Distillation (RISD)</strong>, it internalizes the teacher’s reasoning-conditioned score distribution into a compact model. It reaches <strong>88.6% accuracy</strong> (outperforming the OPD baseline) without needing explicit reasoning chains at inference time, enabling efficient direct scoring and <strong>gradient backpropagation</strong>.</p>\n</li>\n</ul>\n<p>We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, achieving a <strong>41.3% net human-preference improvement</strong> over the SFT baseline.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/692d0e6bb14ceb758205d0dd/AOW00L6pwhBC3W3vNHi_M.jpeg\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/692d0e6bb14ceb758205d0dd/AOW00L6pwhBC3W3vNHi_M.jpeg\" alt=\"20260611103915\"></a></p>\n","updatedAt":"2026-06-11T12:34:06.080Z","author":{"_id":"692d0e6bb14ceb758205d0dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692d0e6bb14ceb758205d0dd/gGVq2KSJE11Sr3LkVn-n5.jpeg","fullname":"Huanqia Cai","name":"Orion-Cai","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.8004651665687561},"editors":["Orion-Cai"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/692d0e6bb14ceb758205d0dd/gGVq2KSJE11Sr3LkVn-n5.jpeg"],"reactions":[{"reaction":"🚀","users":["Srameo","Orion-Cai","DyJiang"],"count":3}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09076","authors":[{"_id":"6a28c219e7d78ea7587e52a6","user":{"_id":"6537e8eab01250d1d6efed3a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/gMx73gwdfEhcCFioStGCE.jpeg","isPro":false,"fullname":"Xin","user":"Srameo","type":"user","name":"Srameo"},"name":"Xin Jin","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:44:17.489Z","hidden":false},{"_id":"6a28c219e7d78ea7587e52a7","user":{"_id":"692d0e6bb14ceb758205d0dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692d0e6bb14ceb758205d0dd/gGVq2KSJE11Sr3LkVn-n5.jpeg","isPro":false,"fullname":"Huanqia Cai","user":"Orion-Cai","type":"user","name":"Orion-Cai"},"name":"Huanqia Cai","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:44:23.299Z","hidden":false},{"_id":"6a28c219e7d78ea7587e52a8","name":"Zhen Li","hidden":false},{"_id":"6a28c219e7d78ea7587e52a9","name":"Zechao Zhan","hidden":false},{"_id":"6a28c219e7d78ea7587e52aa","name":"Dengyang Jiang","hidden":false},{"_id":"6a28c219e7d78ea7587e52ab","name":"Aiming Hao","hidden":false},{"_id":"6a28c219e7d78ea7587e52ac","name":"Yuming Jiang","hidden":false},{"_id":"6a28c219e7d78ea7587e52ad","name":"Chunle Guo","hidden":false},{"_id":"6a28c219e7d78ea7587e52ae","name":"Peng Gao","hidden":false},{"_id":"6a28c219e7d78ea7587e52af","name":"Ming-Ming Cheng","hidden":false},{"_id":"6a28c219e7d78ea7587e52b0","name":"Steven C. H. Hoi","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions","submittedOnDailyBy":{"_id":"692d0e6bb14ceb758205d0dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692d0e6bb14ceb758205d0dd/gGVq2KSJE11Sr3LkVn-n5.jpeg","isPro":false,"fullname":"Huanqia Cai","user":"Orion-Cai","type":"user","name":"Orion-Cai"},"summary":"Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.","upvotes":49,"discussionId":"6a28c219e7d78ea7587e52b1","projectPage":"https://github.com/Tongyi-MAI/Z-Image","ai_summary":"A teacher-student framework decouples complex reasoning from efficient reward deployment in text-to-image training, achieving superior preference accuracy and optimization performance.","ai_keywords":["reward models","visual preference","score distributions","reasoning-based generative rewards","teacher-student framework","Group-wise Direct Score Optimization","Reasoning-Internalized Score Distillation","VLM","policy-gradient rewards","direct pointwise supervision","pairwise supervision","text-to-image optimization"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"6925b20fed452d1567c012d3","name":"Tongyi-MAI","fullname":"Tongyi-MAI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64379d79fac5ea753f1c10f3/fxHO6QoYjdv9_LTyiUD3g.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"692d0e6bb14ceb758205d0dd","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/692d0e6bb14ceb758205d0dd/gGVq2KSJE11Sr3LkVn-n5.jpeg","isPro":false,"fullname":"Huanqia Cai","user":"Orion-Cai","type":"user"},{"_id":"66c62a3013962d19a81d65f1","avatarUrl":"/avatars/da00258857b24896e421356b52e61a85.svg","isPro":false,"fullname":"kf","user":"gaokf","type":"user"},{"_id":"645e553c3b6d85c65e8b0e54","avatarUrl":"/avatars/1fffc6499b9d65b21a895ca96f03b781.svg","isPro":false,"fullname":"Steven","user":"yijunyang","type":"user"},{"_id":"65f947a7ecfebaf3263f1a79","avatarUrl":"/avatars/3eb74a3dc85c8f7ef03218e2266a0f4f.svg","isPro":false,"fullname":"jiangyuhao","user":"JYuhao88","type":"user"},{"_id":"65f7c3fd4dd45806ec9b1e52","avatarUrl":"/avatars/c0732ba48fca795dbf20abfca7d99d45.svg","isPro":false,"fullname":"littleKun2201","user":"littleKun2201","type":"user"},{"_id":"66b108df748540229cd36ef3","avatarUrl":"/avatars/47b6ecdb7d2266ae05c0ccb7a0127aec.svg","isPro":false,"fullname":"anonymous","user":"anonymous-arr","type":"user"},{"_id":"67a84f8ad163c9e6eaebe918","avatarUrl":"/avatars/721747594fb7fd8f4b4af5793c20deb9.svg","isPro":false,"fullname":"Anonymous","user":"Anonymous-Submit-Data","type":"user"},{"_id":"6825b254dc273b7a9908ec18","avatarUrl":"/avatars/c3afb6dcceef130ad58373a5a6e5d1f0.svg","isPro":false,"fullname":"Anonymous","user":"Anonymous-data-model","type":"user"},{"_id":"633b99cfc9b44f5c6ac8fe03","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633b99cfc9b44f5c6ac8fe03/sFmpPlWwo07ttcWWuV1Iw.jpeg","isPro":false,"fullname":"CHQ","user":"huanqia","type":"user"},{"_id":"651f8133dbf879b8c58f5136","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/651f8133dbf879b8c58f5136/0L8Ecgi5Ietkm_DchJwE-.png","isPro":false,"fullname":"Zikai Zhou","user":"Klayand","type":"user"},{"_id":"63f5993afcf95ecac2b419b5","avatarUrl":"/avatars/a8c020080a84d9a663789c4fb19270e9.svg","isPro":false,"fullname":"Mengde Xu","user":"Mendel192","type":"user"},{"_id":"67a21d3c6936681f906d2ca2","avatarUrl":"/avatars/11f1dfabae608963e38cb133bcd353d9.svg","isPro":false,"fullname":"W","user":"HBBbbb123","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6925b20fed452d1567c012d3","name":"Tongyi-MAI","fullname":"Tongyi-MAI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64379d79fac5ea753f1c10f3/fxHO6QoYjdv9_LTyiUD3g.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09076.md"}">
Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions
Abstract
A teacher-student framework decouples complex reasoning from efficient reward deployment in text-to-image training, achieving superior preference accuracy and optimization performance.
Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.
Community
Tech report.
Z-Reward is a reasoning-internalized teacher-student reward modeling framework for visual generation, developed by the Z-Image Team for ⚡-Image.
Z-Reward decouples reasoning-heavy judgment from efficient reward deployment:
🧠 The Teacher (27B): A large VLM that uses reasoning to infer rubric-aligned score distributions. Trained with Group-wise Direct Score Optimization (GDSO), it reaches 89.6% human preference accuracy on our internally annotated evaluation set.
⚡ The Student (9B): Trained with Reasoning-Internalized Score Distillation (RISD), it internalizes the teacher’s reasoning-conditioned score distribution into a compact model. It reaches 88.6% accuracy (outperforming the OPD baseline) without needing explicit reasoning chains at inference time, enabling efficient direct scoring and gradient backpropagation.
We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, achieving a 41.3% net human-preference improvement over the SFT baseline.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.09076 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.09076 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.09076 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.