Hugging Face Daily Papers · · 13 min read

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

DH</sup> | DINOv2-B | 839M | 1.51 | 1.16 |\n| FAE-XL | FAE-DINOv2-G | 675M | 1.48 | 1.29 |\n| **RiT-XL (ours)** | **DINOv2-S** | **676M** | **1.45** | **1.14** |\n\nAll FIDs use 25 Heun steps with the time-shift schedule.\n\nFew-step generation (no distillation, no consistency training):\n\n| Heun steps | 5 | 10 | 25 | 50 |\n|---------------:|-----:|-----:|-----:|-----:|\n| FID (CFG=1.0) | 2.44 | 1.59 | 1.47 | 1.46 |\n| FID (CFG=3.7) | 1.99 | 1.27 | 1.15 | 1.15 |\n\n## Quick start\n\nThe full training/inference code lives at\n[**lezhang7/RiT**](https://github.com/lezhang7/RiT). The eval script auto-pulls\nthis checkpoint plus the matching RAE decoder on first run:\n\n```bash\ngit clone https://github.com/lezhang7/RiT.git\ncd RiT\npip install -r requirements.txt\nbash scripts/eval.sh # CFG=3.7, FID ~1.14 on ImageNet 256x256\n```\n\nTo download just the weights manually:\n\n```python\nfrom huggingface_hub import hf_hub_download\nckpt = hf_hub_download(repo_id=\"le723z/RiT\", filename=\"checkpoint-last.pth\")\nimport torch\nstate = torch.load(ckpt, map_location=\"cpu\", weights_only=False)\n# state['model'] / state['model_ema1'] / state['model_ema2'] are the\n# trainable + two EMA-decay parameter dictionaries.\n```\n\n## Checkpoint contents\n\n`checkpoint-last.pth` is a PyTorch checkpoint produced after 740 training\nepochs (the released model used for the paper's headline numbers). Top-level\nkeys:\n\n- `model` — main parameters of the `Denoiser` (RiT-XL backbone).\n- `model_ema1` — EMA decay 0.9999 (used for sampling by default).\n- `model_ema2` — EMA decay 0.9996 (tracked but unused at inference).\n- `optimizer` — AdamW state for resuming training.\n- `epoch` — `740`.\n- `args` — argparse namespace from the original training run (legacy\n `JiT-RAE-XL/16` model name; the architecture matches the released\n `RiT-XL/16`).\n\nLoading uses only `model` / `model_ema*`, so the legacy `args` field does not\nmatter — `eval.sh` constructs the model from the CLI flags.\n\n## Model details\n\n- **Architecture:** vanilla Diffusion Transformer — 28 layers, hidden 1152,\n 16 heads, SwiGLU FFN, RMSNorm, QK-norm, 2D VisionRoPE, 32 in-context class\n tokens, joint [CLS]-patch modeling.\n- **Encoder (frozen):** `facebook/dinov2-with-registers-small` (d=384).\n- **Decoder (frozen):** ViT-MAE-style decoder from\n [nyu-visionx/RAE-collections](https://huggingface.co/nyu-visionx/RAE-collections),\n variant `decoders/dinov2/wReg_small/ViTXL_n08/model.pt`.\n- **Parameters (denoiser only):** 676M.\n- **Training:** 8×H200, batch 1536 effective, AdamW lr=5e-5, 800 epochs (this\n ckpt: epoch 740), x-prediction loss, dimension-aware time shift\n (s ≈ 4.9), CLS auxiliary loss weight λ=0.2.\n- **Sampling defaults:** Heun, 25 steps, time-shift schedule, CFG=3.7 in\n interval [0.1, 0.98], coupled-noise initialization for [CLS].\n\n## Citation\n\n```bibtex\n@article{zhang2025rit,\n title = {RiT: Vanilla Diffusion Transformers Are Enough in Representation Space},\n author = {Zhang, Le and Mang, Ning and Agrawal, Aishwarya},\n year = {2025}\n}\n```\n\n## Acknowledgments\n\nThis release reuses the frozen DINOv2 encoder + ViT decoder pairing from\n[**RAE**](https://github.com/bytetriper/RAE) and adopts the modernized DiT\nblock design + in-context class tokens from [**JiT**](https://github.com/LTH14/JiT).","html":"<h1 class=\"relative group flex items-baseline\">\n\t<a id=\"rit-xl-vanilla-diffusion-transformers-are-enough-in-representation-space\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#rit-xl-vanilla-diffusion-transformers-are-enough-in-representation-space\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tRiT-XL: Vanilla Diffusion Transformers Are Enough in Representation Space\n\t</span>\n</h1>\n<p>This repository hosts the released <strong>RiT-XL</strong> checkpoint trained for 800 epochs<br>on ImageNet 256×256 with frozen DINOv2-Small features.</p>\n<p><a href=\"https://github.com/lezhang7/RiT\" rel=\"nofollow\"><img src=\"https://img.shields.io/badge/GitHub-lezhang7%2FRiT-181717.svg\" alt=\"GitHub\"></a><br><a href=\"https://arxiv.org/\" rel=\"nofollow\"><img src=\"https://img.shields.io/badge/Paper-arXiv-b31b1b.svg\" alt=\"Paper\"></a></p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"results-on-imagenet-256×256\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#results-on-imagenet-256×256\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tResults on ImageNet 256×256\n\t</span>\n</h2>\n<div class=\"max-w-full overflow-auto\">\n\t<table>\n\t\t<thead><tr>\n<th>Method</th>\n<th align=\"right\">Encoder</th>\n<th align=\"right\">Params</th>\n<th align=\"right\">FID ↓ (CFG=1)</th>\n<th align=\"right\">FID ↓ (CFG≈3.7)</th>\n</tr>\n\n\t\t</thead><tbody><tr>\n<td>DiT-XL</td>\n<td align=\"right\">SD-VAE</td>\n<td align=\"right\">675M</td>\n<td align=\"right\">9.62</td>\n<td align=\"right\">2.27</td>\n</tr>\n<tr>\n<td>SiT-XL</td>\n<td align=\"right\">SD-VAE</td>\n<td align=\"right\">675M</td>\n<td align=\"right\">8.61</td>\n<td align=\"right\">2.06</td>\n</tr>\n<tr>\n<td>REPA-XL</td>\n<td align=\"right\">SD-VAE</td>\n<td align=\"right\">675M</td>\n<td align=\"right\">5.78</td>\n<td align=\"right\">1.29</td>\n</tr>\n<tr>\n<td>DDT-XL</td>\n<td align=\"right\">SD-VAE</td>\n<td align=\"right\">675M</td>\n<td align=\"right\">6.27</td>\n<td align=\"right\">1.26</td>\n</tr>\n<tr>\n<td>REG-XL</td>\n<td align=\"right\">SD-VAE</td>\n<td align=\"right\">675M</td>\n<td align=\"right\">1.80</td>\n<td align=\"right\">1.36</td>\n</tr>\n<tr>\n<td>RAE-XL</td>\n<td align=\"right\">DINOv2-S</td>\n<td align=\"right\">676M</td>\n<td align=\"right\">1.87</td>\n<td align=\"right\">1.41</td>\n</tr>\n<tr>\n<td>RAE-XL<sup>DH</sup></td>\n<td align=\"right\">DINOv2-B</td>\n<td align=\"right\">839M</td>\n<td align=\"right\">1.51</td>\n<td align=\"right\">1.16</td>\n</tr>\n<tr>\n<td>FAE-XL</td>\n<td align=\"right\">FAE-DINOv2-G</td>\n<td align=\"right\">675M</td>\n<td align=\"right\">1.48</td>\n<td align=\"right\">1.29</td>\n</tr>\n<tr>\n<td><strong>RiT-XL (ours)</strong></td>\n<td align=\"right\"><strong>DINOv2-S</strong></td>\n<td align=\"right\"><strong>676M</strong></td>\n<td align=\"right\"><strong>1.45</strong></td>\n<td align=\"right\"><strong>1.14</strong></td>\n</tr>\n</tbody>\n\t</table>\n</div>\n<p>All FIDs use 25 Heun steps with the time-shift schedule.</p>\n<p>Few-step generation (no distillation, no consistency training):</p>\n<div class=\"max-w-full overflow-auto\">\n\t<table>\n\t\t<thead><tr>\n<th align=\"right\">Heun steps</th>\n<th align=\"right\">5</th>\n<th align=\"right\">10</th>\n<th align=\"right\">25</th>\n<th align=\"right\">50</th>\n</tr>\n\n\t\t</thead><tbody><tr>\n<td align=\"right\">FID (CFG=1.0)</td>\n<td align=\"right\">2.44</td>\n<td align=\"right\">1.59</td>\n<td align=\"right\">1.47</td>\n<td align=\"right\">1.46</td>\n</tr>\n<tr>\n<td align=\"right\">FID (CFG=3.7)</td>\n<td align=\"right\">1.99</td>\n<td align=\"right\">1.27</td>\n<td align=\"right\">1.15</td>\n<td align=\"right\">1.15</td>\n</tr>\n</tbody>\n\t</table>\n</div>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"quick-start\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#quick-start\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tQuick start\n\t</span>\n</h2>\n<p>The full training/inference code lives at<br><a href=\"https://github.com/lezhang7/RiT\" rel=\"nofollow\"><strong>lezhang7/RiT</strong></a>. The eval script auto-pulls<br>this checkpoint plus the matching RAE decoder on first run:</p>\n<pre><code class=\"language-bash\">git <span class=\"hljs-built_in\">clone</span> https://github.com/lezhang7/RiT.git\n<span class=\"hljs-built_in\">cd</span> RiT\npip install -r requirements.txt\nbash scripts/eval.sh <span class=\"hljs-comment\"># CFG=3.7, FID ~1.14 on ImageNet 256x256</span>\n</code></pre>\n<p>To download just the weights manually:</p>\n<pre><code class=\"language-python\"><span class=\"hljs-keyword\">from</span> huggingface_hub <span class=\"hljs-keyword\">import</span> hf_hub_download\nckpt = hf_hub_download(repo_id=<span class=\"hljs-string\">\"le723z/RiT\"</span>, filename=<span class=\"hljs-string\">\"checkpoint-last.pth\"</span>)\n<span class=\"hljs-keyword\">import</span> torch\nstate = torch.load(ckpt, map_location=<span class=\"hljs-string\">\"cpu\"</span>, weights_only=<span class=\"hljs-literal\">False</span>)\n<span class=\"hljs-comment\"># state['model'] / state['model_ema1'] / state['model_ema2'] are the</span>\n<span class=\"hljs-comment\"># trainable + two EMA-decay parameter dictionaries.</span>\n</code></pre>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"checkpoint-contents\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#checkpoint-contents\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tCheckpoint contents\n\t</span>\n</h2>\n<p><code>checkpoint-last.pth</code> is a PyTorch checkpoint produced after 740 training<br>epochs (the released model used for the paper's headline numbers). Top-level<br>keys:</p>\n<ul>\n<li><code>model</code> — main parameters of the <code>Denoiser</code> (RiT-XL backbone).</li>\n<li><code>model_ema1</code> — EMA decay 0.9999 (used for sampling by default).</li>\n<li><code>model_ema2</code> — EMA decay 0.9996 (tracked but unused at inference).</li>\n<li><code>optimizer</code> — AdamW state for resuming training.</li>\n<li><code>epoch</code> — <code>740</code>.</li>\n<li><code>args</code> — argparse namespace from the original training run (legacy<br><code>JiT-RAE-XL/16</code> model name; the architecture matches the released<br><code>RiT-XL/16</code>).</li>\n</ul>\n<p>Loading uses only <code>model</code> / <code>model_ema*</code>, so the legacy <code>args</code> field does not<br>matter — <code>eval.sh</code> constructs the model from the CLI flags.</p>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"model-details\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#model-details\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tModel details\n\t</span>\n</h2>\n<ul>\n<li><strong>Architecture:</strong> vanilla Diffusion Transformer — 28 layers, hidden 1152,<br>16 heads, SwiGLU FFN, RMSNorm, QK-norm, 2D VisionRoPE, 32 in-context class<br>tokens, joint [CLS]-patch modeling.</li>\n<li><strong>Encoder (frozen):</strong> <code>facebook/dinov2-with-registers-small</code> (d=384).</li>\n<li><strong>Decoder (frozen):</strong> ViT-MAE-style decoder from<br><a href=\"https://huggingface.co/nyu-visionx/RAE-collections\">nyu-visionx/RAE-collections</a>,<br>variant <code>decoders/dinov2/wReg_small/ViTXL_n08/model.pt</code>.</li>\n<li><strong>Parameters (denoiser only):</strong> 676M.</li>\n<li><strong>Training:</strong> 8×H200, batch 1536 effective, AdamW lr=5e-5, 800 epochs (this<br>ckpt: epoch 740), x-prediction loss, dimension-aware time shift<br>(s ≈ 4.9), CLS auxiliary loss weight λ=0.2.</li>\n<li><strong>Sampling defaults:</strong> Heun, 25 steps, time-shift schedule, CFG=3.7 in<br>interval [0.1, 0.98], coupled-noise initialization for [CLS].</li>\n</ul>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"citation\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#citation\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tCitation\n\t</span>\n</h2>\n<pre><code class=\"language-bibtex\"><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;article&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/article\">@<span class=\"underline\">article</span></a></span> </span></span>{zhang2025rit,\n title = {RiT: Vanilla Diffusion Transformers Are Enough in Representation Space},\n author = {Zhang, Le and Mang, Ning and Agrawal, Aishwarya},\n year = {2025}\n}\n</code></pre>\n<h2 class=\"relative group flex items-baseline\">\n\t<a id=\"acknowledgments\" class=\"block pr-1.5 text-lg md:absolute md:p-1.5 md:opacity-0 md:group-hover:opacity-100 md:right-full\" href=\"#acknowledgments\" rel=\"nofollow\">\n\t\t<span class=\"header-link\"><svg class=\"text-gray-500 hover:text-black dark:hover:text-gray-200 w-4\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" aria-hidden=\"true\" role=\"img\" width=\"1em\" height=\"1em\" preserveAspectRatio=\"xMidYMid meet\" viewBox=\"0 0 256 256\"><path d=\"M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z\" fill=\"currentColor\"></path></svg></span>\n\t</a>\n\t<span>\n\t\tAcknowledgments\n\t</span>\n</h2>\n<p>This release reuses the frozen DINOv2 encoder + ViT decoder pairing from<br><a href=\"https://github.com/bytetriper/RAE\" rel=\"nofollow\"><strong>RAE</strong></a> and adopts the modernized DiT<br>block design + in-context class tokens from <a href=\"https://github.com/LTH14/JiT\" rel=\"nofollow\"><strong>JiT</strong></a>.</p>\n","updatedAt":"2026-05-22T00:55:06.859Z","author":{"_id":"633b423e5df91da9ceafe40a","avatarUrl":"/avatars/6d8c31c5f3e64d8a085b6e1228bcc44d.svg","fullname":"le.zhang","name":"le723z","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5412813425064087},"editors":["le723z"],"editorAvatarUrls":["/avatars/6d8c31c5f3e64d8a085b6e1228bcc44d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.21981","authors":[{"_id":"6a0fa91ba53a61ce2e422bba","name":"Le Zhang","hidden":false},{"_id":"6a0fa91ba53a61ce2e422bbb","name":"Ning Mang","hidden":false},{"_id":"6a0fa91ba53a61ce2e422bbc","name":"Aishwarya Agrawal","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"RiT: Vanilla Diffusion Transformers Suffice in Representation Space","submittedOnDailyBy":{"_id":"633b423e5df91da9ceafe40a","avatarUrl":"/avatars/6d8c31c5f3e64d8a085b6e1228bcc44d.svg","isPro":false,"fullname":"le.zhang","user":"le723z","type":"user","name":"le723z"},"summary":"Flow matching with x-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space li2025back. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both d!approx!33) yet DINOv2 exhibits 7.3times higher effective rank, 35times better covariance conditioning, 11.5times lower excess kurtosis, and 1.7times lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the Representation Image Transformer (RiT): a vanilla Diffusion Transformer trained by x-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint [CLS]-patch modeling. On ImageNet 256{times}256, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT^DH-XL with 19% fewer parameters (676M vs.\\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, 5 Heun steps already reach FID 2.0 and 10 steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.","upvotes":3,"discussionId":"6a0fa91ba53a61ce2e422bbd","githubRepo":"https://github.com/lezhang7/RiT","githubRepoAddedBy":"user","ai_summary":"Flow matching in representation spaces with improved statistical properties enables efficient diffusion model training with reduced parameters and fast sampling.","ai_keywords":["flow matching","$x$-prediction","ambient velocity","low-dimensional manifold","pretrained representation space","intrinsic dimensionality","effective rank","covariance conditioning","excess kurtosis","on-manifold interpolation error","diffusion transformer","classifier-free guidance","ODE solver","Heun steps"],"githubStars":4,"organization":{"_id":"660264df4f84a3a817234726","name":"mila-intel","fullname":"MILA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62f0a673f0d40f6aae296b4a/q80MJ3msx57BQz6BCYjSp.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"633b423e5df91da9ceafe40a","avatarUrl":"/avatars/6d8c31c5f3e64d8a085b6e1228bcc44d.svg","isPro":false,"fullname":"le.zhang","user":"le723z","type":"user"},{"_id":"643ee7606d5be535d28034f6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/643ee7606d5be535d28034f6/1F241mP_buW74byC2GAWH.jpeg","isPro":false,"fullname":"Shravan Nayak","user":"BAJUKA","type":"user"},{"_id":"6419838ce8a6183a8caae6c2","avatarUrl":"/avatars/361b9925bbf9921e34f541b5a36baf37.svg","isPro":false,"fullname":"Kanishk","user":"kanji95","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"660264df4f84a3a817234726","name":"mila-intel","fullname":"MILA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62f0a673f0d40f6aae296b4a/q80MJ3msx57BQz6BCYjSp.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.21981.md"}">
Papers
arxiv:2605.21981

RiT: Vanilla Diffusion Transformers Suffice in Representation Space

Published on May 21
· Submitted by
le.zhang
on May 21
Authors:
,
,

Abstract

Flow matching in representation spaces with improved statistical properties enables efficient diffusion model training with reduced parameters and fast sampling.

AI-generated summary

Flow matching with x-prediction -- regressing the clean data point rather than the ambient velocity -- is known to exploit low-dimensional manifold structure effectively in pixel space li2025back. We ask whether a pretrained representation space, while containing a low-dimensional data manifold of comparable intrinsic dimensionality, offers a distribution more favorable for flow-matching learning. Comparing pixel, SD-VAE, and DINOv2 features along four geometric axes, we find that pixel and DINOv2 share nearly identical intrinsic dimensionalities (both d!approx!33) yet DINOv2 exhibits 7.3times higher effective rank, 35times better covariance conditioning, 11.5times lower excess kurtosis, and 1.7times lower on-manifold interpolation error; SD-VAE latents are consistently intermediate, indicating that the advantage stems from representation-learning objectives rather than mere compression. These statistical properties render the flow-matching regression well-conditioned and remove the need for the specialized prediction heads or Riemannian transport used by prior DINOv2 diffusion methods. We propose the Representation Image Transformer (RiT): a vanilla Diffusion Transformer trained by x-prediction on frozen DINOv2 features, augmented only by a dimension-aware noise schedule and joint [CLS]-patch modeling. On ImageNet 256{times}256, RiT attains FID 1.45 without guidance and 1.14 with classifier-free guidance, outperforming DiT^DH-XL with 19% fewer parameters (676M vs.\ 839M). The resulting ODE is efficiently solvable at coarse discretizations: with classifier-free guidance, 5 Heun steps already reach FID 2.0 and 10 steps reach 1.25, without distillation or consistency training. Code at https://github.com/lezhang7/RiT.

Community

Paper submitter about 1 hour ago

RiT-XL: Vanilla Diffusion Transformers Are Enough in Representation Space

This repository hosts the released RiT-XL checkpoint trained for 800 epochs
on ImageNet 256×256 with frozen DINOv2-Small features.

GitHub
Paper

Results on ImageNet 256×256

Method Encoder Params FID ↓ (CFG=1) FID ↓ (CFG≈3.7)
DiT-XL SD-VAE 675M 9.62 2.27
SiT-XL SD-VAE 675M 8.61 2.06
REPA-XL SD-VAE 675M 5.78 1.29
DDT-XL SD-VAE 675M 6.27 1.26
REG-XL SD-VAE 675M 1.80 1.36
RAE-XL DINOv2-S 676M 1.87 1.41
RAE-XLDH DINOv2-B 839M 1.51 1.16
FAE-XL FAE-DINOv2-G 675M 1.48 1.29
RiT-XL (ours) DINOv2-S 676M 1.45 1.14

All FIDs use 25 Heun steps with the time-shift schedule.

Few-step generation (no distillation, no consistency training):

Heun steps 5 10 25 50
FID (CFG=1.0) 2.44 1.59 1.47 1.46
FID (CFG=3.7) 1.99 1.27 1.15 1.15

Quick start

The full training/inference code lives at
lezhang7/RiT. The eval script auto-pulls
this checkpoint plus the matching RAE decoder on first run:

git clone https://github.com/lezhang7/RiT.git
cd RiT
pip install -r requirements.txt
bash scripts/eval.sh        # CFG=3.7, FID ~1.14 on ImageNet 256x256

To download just the weights manually:

from huggingface_hub import hf_hub_download
ckpt = hf_hub_download(repo_id="le723z/RiT", filename="checkpoint-last.pth")
import torch
state = torch.load(ckpt, map_location="cpu", weights_only=False)
# state['model'] / state['model_ema1'] / state['model_ema2'] are the
# trainable + two EMA-decay parameter dictionaries.

Checkpoint contents

checkpoint-last.pth is a PyTorch checkpoint produced after 740 training
epochs (the released model used for the paper's headline numbers). Top-level
keys:

  • model — main parameters of the Denoiser (RiT-XL backbone).
  • model_ema1 — EMA decay 0.9999 (used for sampling by default).
  • model_ema2 — EMA decay 0.9996 (tracked but unused at inference).
  • optimizer — AdamW state for resuming training.
  • epoch740.
  • args — argparse namespace from the original training run (legacy
    JiT-RAE-XL/16 model name; the architecture matches the released
    RiT-XL/16).

Loading uses only model / model_ema*, so the legacy args field does not
matter — eval.sh constructs the model from the CLI flags.

Model details

  • Architecture: vanilla Diffusion Transformer — 28 layers, hidden 1152,
    16 heads, SwiGLU FFN, RMSNorm, QK-norm, 2D VisionRoPE, 32 in-context class
    tokens, joint [CLS]-patch modeling.
  • Encoder (frozen): facebook/dinov2-with-registers-small (d=384).
  • Decoder (frozen): ViT-MAE-style decoder from
    nyu-visionx/RAE-collections,
    variant decoders/dinov2/wReg_small/ViTXL_n08/model.pt.
  • Parameters (denoiser only): 676M.
  • Training: 8×H200, batch 1536 effective, AdamW lr=5e-5, 800 epochs (this
    ckpt: epoch 740), x-prediction loss, dimension-aware time shift
    (s ≈ 4.9), CLS auxiliary loss weight λ=0.2.
  • Sampling defaults: Heun, 25 steps, time-shift schedule, CFG=3.7 in
    interval [0.1, 0.98], coupled-noise initialization for [CLS].

Citation

@article {zhang2025rit,
  title  = {RiT: Vanilla Diffusion Transformers Are Enough in Representation Space},
  author = {Zhang, Le and Mang, Ning and Agrawal, Aishwarya},
  year   = {2025}
}

Acknowledgments

This release reuses the frozen DINOv2 encoder + ViT decoder pairing from
RAE and adopts the modernized DiT
block design + in-context class tokens from JiT.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.21981
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.21981 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.21981 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.21981 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers