Hugging Face Daily Papers · May 22, 2026 · 12 min read

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<strong>Project Page:</strong> <a href=\"https://syjmelody.github.io/RankE/\" rel=\"nofollow\">https://syjmelody.github.io/RankE/</a><br><strong>GitHub:</strong> <a href=\"https://github.com/syjmelody/RankE\" rel=\"nofollow\">https://github.com/syjmelody/RankE</a></p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"⚡-tldr\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#⚡-tldr\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t⚡ TL;DR\n\t</span>\n</h2>\n<p>RankE is the <strong>first end-to-end post-training framework for discrete text-to-image generation</strong> that jointly optimizes the <strong>Generator</strong> and the <strong>Decoder</strong>. Instead of improving reward scores at the cost of image quality, RankE improves both alignment and fidelity at the same time.</p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"🤔-background-what-is-the-problem\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🤔-background-what-is-the-problem\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🤔 Background: What is the problem?\n\t</span>\n</h2>\n<p>Most discrete text-to-image models still follow a two-stage pipeline:</p>\n<ol>\n<li>train a VQ-VAE / tokenizer to map images into discrete visual tokens;</li>\n<li>train an autoregressive Generator to model those tokens.</li>\n</ol>\n<p>This pipeline works well for pretraining, but post-training is usually incomplete: existing methods optimize the <strong>Generator only</strong> and keep the <strong>Decoder frozen</strong>.</p>\n<p>That creates a mismatch. As the Generator is optimized to chase higher rewards, its output token distribution gradually drifts away from the real token distribution that the Decoder saw during tokenizer training. The result is a frustrating trade-off:</p>\n<ul>\n<li>reward scores go up,</li>\n<li>but decoded image quality can get worse.</li>\n</ul>\n<p>The paper identifies this issue as <strong>Latent Covariate Shift</strong>.</p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"🔍-why-existing-solutions-are-not-enough\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🔍-why-existing-solutions-are-not-enough\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🔍 Why existing solutions are not enough\n\t</span>\n</h2>\n<p>Recent work such as <strong>REPA-E</strong> has shown that in continuous diffusion models, the autoencoder is not just a supporting module — it can be a real bottleneck for alignment and visual quality.</p>\n<p>But discrete T2I is harder.<br>Because token sampling and vector quantization are discrete operations, gradients cannot flow cleanly through the entire generation process. That is why most existing RL or preference-optimization methods for discrete generation still update only the Generator while leaving the Decoder unchanged.</p>\n<p>So the field already knows the decoder matters — but a practical end-to-end solution for discrete generation has been missing.</p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"🚀-what-ranke-does\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🚀-what-ranke-does\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🚀 What RankE does\n\t</span>\n</h2>\n<p>RankE addresses this directly by making the <strong>Generator and Decoder co-evolve</strong>.</p>\n<p>Its core insight is simple:<br><strong>if Generator optimization already behaves like a ranking process over latent token sequences, why not extend the same ranking principle to pixel-space Decoder optimization?</strong></p>\n<p>RankE therefore uses <strong>alternating optimization</strong>:</p>\n<ul>\n<li><strong>Generator step:</strong> optimize the policy so that higher-reward latent token sequences receive stronger updates;</li>\n<li><strong>Decoder step:</strong> optimize the Decoder so it can better adapt to the Generator’s evolving token distribution, while also favoring higher-reward decoded images.</li>\n</ul>\n<p>In other words, RankE does not just make the model “better at scoring well.” It aligns optimization across <strong>both latent space and pixel space</strong>.</p>\n<p>This is the key difference from standard frozen-decoder RL.</p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"🧠-why-this-matters\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#🧠-why-this-matters\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t🧠 Why this matters\n\t</span>\n</h2>\n<p>In standard RL post-training for discrete T2I, the Generator keeps changing, but the Decoder stays fixed. Over time, the Decoder is forced to decode token patterns it was never really trained to handle.</p>\n<p>RankE removes this bottleneck by continuously adapting the Decoder during post-training. This turns reward optimization into actual visual improvement, rather than reward hacking in latent space.</p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"📈-results\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#📈-results\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t📈 Results\n\t</span>\n</h2>\n<p>The gains are clear.</p>\n<p>On <strong>LlamaGen-XL (775M)</strong> under CLIP-based optimization:</p>\n<ul>\n<li><strong>Standard RL:</strong> improves CLIP, but hurts FID</li>\n<li><strong>RankE:</strong> improves both</li>\n</ul>\n<p>Specifically:</p>\n<ul>\n<li><strong>CLIP:</strong> 32.45 → <strong>33.76</strong></li>\n<li><strong>FID:</strong> 17.76 → <strong>15.21</strong></li>\n</ul>\n<p>That is the main message of the paper:<br><strong>RankE breaks the common fidelity–alignment trade-off in discrete text-to-image post-training.</strong></p>\n<p>The improvements are also consistent across:</p>\n<ul>\n<li>different backbones,</li>\n<li>different reward functions,</li>\n<li>and multiple evaluation settings.</li>\n</ul>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"✨-one-line-takeaway\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#✨-one-line-takeaway\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\t✨ One-line takeaway\n\t</span>\n</h2>\n<p><strong>RankE is a more natural way to post-train discrete text-to-image models: instead of optimizing only the Generator, it lets the Generator and Decoder improve together.</strong></p>\n","updatedAt":"2026-05-22T15:18:14.509Z","author":{"_id":"673ee069513dd08f78fde43d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673ee069513dd08f78fde43d/-NyOWigCOjUCQR8O6Dm61.jpeg","fullname":"Siyong Jian","name":"syjian","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8599147200584412},"editors":["syjian"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/673ee069513dd08f78fde43d/-NyOWigCOjUCQR8O6Dm61.jpeg"],"reactions":[{"reaction":"🔥","users":["Lupin1998","Jerry-98"],"count":2},{"reaction":"👍","users":["Lupin1998","Jerry-98"],"count":2},{"reaction":"🚀","users":["Lupin1998"],"count":1}],"isReport":false}},{"id":"6a13c07cc134b7b3c10a159f","author":{"_id":"673ee069513dd08f78fde43d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673ee069513dd08f78fde43d/-NyOWigCOjUCQR8O6Dm61.jpeg","fullname":"Siyong Jian","name":"syjian","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2026-05-25T03:22:36.000Z","type":"comment","data":{"edited":true,"hidden":true,"hiddenBy":"","hiddenReason":"Resolved","latest":{"raw":"This comment has been hidden","html":"This comment has been hidden","updatedAt":"2026-05-25T03:23:46.073Z","author":{"_id":"673ee069513dd08f78fde43d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673ee069513dd08f78fde43d/-NyOWigCOjUCQR8O6Dm61.jpeg","fullname":"Siyong Jian","name":"syjian","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"editors":[],"editorAvatarUrls":[],"reactions":[]}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.21195","authors":[{"_id":"6a1057bfa53a61ce2e422fdf","user":{"_id":"673ee069513dd08f78fde43d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673ee069513dd08f78fde43d/-NyOWigCOjUCQR8O6Dm61.jpeg","isPro":false,"fullname":"Siyong Jian","user":"syjian","type":"user","name":"syjian"},"name":"Siyong Jian","status":"claimed_verified","statusLastChangedAt":"2026-05-22T15:58:59.915Z","hidden":false},{"_id":"6a1057bfa53a61ce2e422fe0","name":"Siyuan Li","hidden":false},{"_id":"6a1057bfa53a61ce2e422fe1","name":"Luyuan Zhang","hidden":false},{"_id":"6a1057bfa53a61ce2e422fe2","name":"Zedong Wang","hidden":false},{"_id":"6a1057bfa53a61ce2e422fe3","name":"Xin Jin","hidden":false},{"_id":"6a1057bfa53a61ce2e422fe4","name":"Ying Li","hidden":false},{"_id":"6a1057bfa53a61ce2e422fe5","name":"Cheng Tan","hidden":false},{"_id":"6a1057bfa53a61ce2e422fe6","name":"Huan Wang","hidden":false}],"publishedAt":"2026-05-20T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution","submittedOnDailyBy":{"_id":"673ee069513dd08f78fde43d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673ee069513dd08f78fde43d/-NyOWigCOjUCQR8O6Dm61.jpeg","isPro":false,"fullname":"Siyong Jian","user":"syjian","type":"user","name":"syjian"},"summary":"Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity--alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.","upvotes":13,"discussionId":"6a1057c0a53a61ce2e422fe7","projectPage":"https://arxiv.org/pdf/2605.21195","githubRepo":"https://github.com/syjmelody/RankE","githubRepoAddedBy":"user","ai_summary":"Discrete autoregressive text-to-image models suffer from latent covariate shift during policy optimization, which RankE addresses through end-to-end co-evolution of policy and decoder components.","ai_keywords":["VQ tokenizer","autoregressive policy","VAE","latent covariate shift","RankE","alternating optimization","ranking-based alignment objective","stability-preserving anchor","CLIP","FID"],"githubStars":14,"organization":{"_id":"643cb10025681c3afab0f1a6","name":"Westlake-University","fullname":"Westlake University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6437eca0819f3ab20d162e14/SQRCHUyjPRyqdtV3um42X.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66aa39349238d9c3a1c7f9dc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66aa39349238d9c3a1c7f9dc/mj6r7uxEYXM502x296UMf.jpeg","isPro":false,"fullname":"Xin Jin","user":"Xin1118","type":"user"},{"_id":"673ee069513dd08f78fde43d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673ee069513dd08f78fde43d/-NyOWigCOjUCQR8O6Dm61.jpeg","isPro":false,"fullname":"Siyong Jian","user":"syjian","type":"user"},{"_id":"67d5848f179ad2756600eca3","avatarUrl":"/avatars/158168a753271b6e024e1fbdf52c9e73.svg","isPro":false,"fullname":"Junhan ZHU","user":"Alrightlone","type":"user"},{"_id":"6694aa3f286045211d4b86dd","avatarUrl":"/avatars/b6fd2f95264f710039d1a7a497be6ed2.svg","isPro":false,"fullname":"graenys","user":"graenys","type":"user"},{"_id":"66ebba2b1a537888d23af2b9","avatarUrl":"/avatars/83cd3c87e8c21f5dcc145b9556b8bbfe.svg","isPro":false,"fullname":"Xueyi Chen","user":"YIGE24","type":"user"},{"_id":"66966286ad7167254c4bb5d6","avatarUrl":"/avatars/1a3136918a74d7ce778dcee0ca93c411.svg","isPro":false,"fullname":"Kele Shao","user":"cokeshao","type":"user"},{"_id":"62b624f3b52bef716e248fd7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b624f3b52bef716e248fd7/AllcccKH-eBWduA8KVnOQ.png","isPro":false,"fullname":"Huan Wang","user":"Huan-WhoRegisteredMyName","type":"user"},{"_id":"691469212ac382b46b1f1d47","avatarUrl":"/avatars/3c8b5e6b1fe8e6071e129aace2ed910b.svg","isPro":false,"fullname":"fate bai","user":"ffffate","type":"user"},{"_id":"6793290257b2fe2b1ef434b5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/COZJKTHI4RPqjs7h9iz19.png","isPro":false,"fullname":"Billy","user":"Billy1377","type":"user"},{"_id":"640f7083208821a59b74c757","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1678735253848-640f7083208821a59b74c757.jpeg","isPro":false,"fullname":"Siyuan Li","user":"Lupin1998","type":"user"},{"_id":"683f2e9fa073d45457ce420d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/g2WWieHoqAeG8gb1qWL5J.png","isPro":false,"fullname":"Jason Lee","user":"Jerry-98","type":"user"},{"_id":"69a12e3a663af458fd86167c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/7GMxFuHV0GfbCxtl3nVfS.png","isPro":false,"fullname":"Cheng Lizhong","user":"LZChen9917","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"643cb10025681c3afab0f1a6","name":"Westlake-University","fullname":"Westlake University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6437eca0819f3ab20d162e14/SQRCHUyjPRyqdtV3um42X.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.21195.md"}">

Papers

arxiv:2605.21195

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Published on May 20

· Submitted by

Siyong Jian on May 25

Westlake University

Upvote

Authors:

Siyong Jian ,

Abstract

Discrete autoregressive text-to-image models suffer from latent covariate shift during policy optimization, which RankE addresses through end-to-end co-evolution of policy and decoder components.

AI-generated summary

Discrete autoregressive (AR) text-to-image (T2I) models pair a VQ tokenizer with an AR policy, and current post-training pipelines optimize only the policy while keeping the VQ decoder frozen. Recent diffusion T2I work, exemplified by REPA-E, has shown that the VAE itself constitutes a key alignment bottleneck, yet no analogous investigation exists for discrete AR models. We show that policy-only optimization induces Latent Covariate Shift: as the policy evolves, the resulting token distribution diverges from the ground-truth distribution on which the decoder was trained, such that reward scores improve while decoded image quality degrades. To address this mismatch, we propose RankE, the first end-to-end post-training framework for discrete T2I generation. Rather than optimizing the policy against a fixed decoder, RankE co-evolves both components through alternating optimization: each module maximizes a ranking-based alignment objective while being regularized by a stability-preserving anchor suited to its parameter space. This co-evolution breaks the fidelity--alignment trade-off that plagues frozen-decoder approaches: on LlamaGen-XL (775M), standard RL improves CLIP but degrades FID, whereas RankE improves both simultaneously (FID 15.21, CLIP 33.76 on MS-COCO 30K). Consistent gains on Janus-Pro (1B) confirm that decoder co-evolution reliably converts reward optimization into pixel-space quality improvements.

View arXiv page View PDF Project page GitHub 14 Add to collection

Community

syjian

Paper author Paper submitter 3 days ago

Project Page: https://syjmelody.github.io/RankE/
GitHub: https://github.com/syjmelody/RankE

⚡ TL;DR

RankE is the first end-to-end post-training framework for discrete text-to-image generation that jointly optimizes the Generator and the Decoder. Instead of improving reward scores at the cost of image quality, RankE improves both alignment and fidelity at the same time.

🤔 Background: What is the problem?

Most discrete text-to-image models still follow a two-stage pipeline:

train a VQ-VAE / tokenizer to map images into discrete visual tokens;
train an autoregressive Generator to model those tokens.

This pipeline works well for pretraining, but post-training is usually incomplete: existing methods optimize the Generator only and keep the Decoder frozen.

That creates a mismatch. As the Generator is optimized to chase higher rewards, its output token distribution gradually drifts away from the real token distribution that the Decoder saw during tokenizer training. The result is a frustrating trade-off:

reward scores go up,
but decoded image quality can get worse.

The paper identifies this issue as Latent Covariate Shift.

🔍 Why existing solutions are not enough

Recent work such as REPA-E has shown that in continuous diffusion models, the autoencoder is not just a supporting module — it can be a real bottleneck for alignment and visual quality.

But discrete T2I is harder.
Because token sampling and vector quantization are discrete operations, gradients cannot flow cleanly through the entire generation process. That is why most existing RL or preference-optimization methods for discrete generation still update only the Generator while leaving the Decoder unchanged.

So the field already knows the decoder matters — but a practical end-to-end solution for discrete generation has been missing.

🚀 What RankE does

RankE addresses this directly by making the Generator and Decoder co-evolve.

Its core insight is simple:
if Generator optimization already behaves like a ranking process over latent token sequences, why not extend the same ranking principle to pixel-space Decoder optimization?

RankE therefore uses alternating optimization:

Generator step: optimize the policy so that higher-reward latent token sequences receive stronger updates;
Decoder step: optimize the Decoder so it can better adapt to the Generator’s evolving token distribution, while also favoring higher-reward decoded images.

In other words, RankE does not just make the model “better at scoring well.” It aligns optimization across both latent space and pixel space.

This is the key difference from standard frozen-decoder RL.

🧠 Why this matters

In standard RL post-training for discrete T2I, the Generator keeps changing, but the Decoder stays fixed. Over time, the Decoder is forced to decode token patterns it was never really trained to handle.

RankE removes this bottleneck by continuously adapting the Decoder during post-training. This turns reward optimization into actual visual improvement, rather than reward hacking in latent space.

📈 Results

The gains are clear.

On LlamaGen-XL (775M) under CLIP-based optimization:

Standard RL: improves CLIP, but hurts FID
RankE: improves both

Specifically:

CLIP: 32.45 → 33.76
FID: 17.76 → 15.21

That is the main message of the paper:
RankE breaks the common fidelity–alignment trade-off in discrete text-to-image post-training.

The improvements are also consistent across:

different backbones,
different reward functions,
and multiple evaluation settings.

✨ One-line takeaway

RankE is a more natural way to post-train discrete text-to-image models: instead of optimizing only the Generator, it lets the Generator and Decoder improve together.

syjian

Paper author Paper submitter about 8 hours ago

This comment has been hidden (marked as Resolved)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.21195

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.21195 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.21195 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.21195 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

Abstract

Community

⚡ TL;DR

🤔 Background: What is the problem?

🔍 Why existing solutions are not enough

🚀 What RankE does

🧠 Why this matters

📈 Results

✨ One-line takeaway

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers