Hugging Face Daily Papers · May 25, 2026 · 6 min read

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Hi everyone! We’re excited to share our work “Rethinking Cross-Layer Information Routing in Diffusion Transformers.”\nWhile DiTs have been extensively improved in tokenization, attention, conditioning, objectives, and autoencoders, their residual stream is still largely inherited from the original Transformer. We revisit this overlooked design axis and analyze cross-layer information flow in DiTs across both depth and denoising timestep.\nOur diagnosis reveals three symptoms of standard residual routing: forward magnitude inflation, backward gradient decay, and block-wise redundancy. Based on this, we propose Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over previous sublayer outputs.\nOn ImageNet 256×256, DAR improves SiT-XL/2 from 9.67 to 7.56 FID and reaches the baseline’s converged quality with 8.75× fewer iterations. DAR is also complementary to REPA, bringing a 2× early-stage speedup, and helps preserve high-frequency details during DMD for large-scale T2I models.\nWe hope this work highlights cross-layer routing as an underexplored but promising direction for Diffusion Transformers. Feedback is very welcome!\n","updatedAt":"2026-05-25T03:25:27.005Z","author":{"_id":"671b4660a3fd72a462e97330","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/671b4660a3fd72a462e97330/QbIZR5DqA6sGBvzw0vstV.jpeg","fullname":"Maohua Li","name":"Met4physics","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8835051655769348},"editors":["Met4physics"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/671b4660a3fd72a462e97330/QbIZR5DqA6sGBvzw0vstV.jpeg"],"reactions":[{"reaction":"🤗","users":["hyyyyyyyyy1"],"count":1}],"isReport":false},"replies":[{"id":"6a14034c9e442908e2cd349f","author":{"_id":"62af665424488e6adfa9b8e2","avatarUrl":"/avatars/2bdb4a26fde4cbe5b4673e53e0d44540.svg","fullname":"Edmond Jacoupeau","name":"edmond","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-05-25T08:07:40.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thanks for this interesting paper. Are you planning to release the code ?","html":"Thanks for this interesting paper. Are you planning to release the code ?\n","updatedAt":"2026-05-25T08:07:40.130Z","author":{"_id":"62af665424488e6adfa9b8e2","avatarUrl":"/avatars/2bdb4a26fde4cbe5b4673e53e0d44540.svg","fullname":"Edmond Jacoupeau","name":"edmond","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9161117076873779},"editors":["edmond"],"editorAvatarUrls":["/avatars/2bdb4a26fde4cbe5b4673e53e0d44540.svg"],"reactions":[],"isReport":false,"parentCommentId":"6a13c1277b97d4349d2295de"}},{"id":"6a140957c134b7b3c113a813","author":{"_id":"671b4660a3fd72a462e97330","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/671b4660a3fd72a462e97330/QbIZR5DqA6sGBvzw0vstV.jpeg","fullname":"Maohua Li","name":"Met4physics","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false},"createdAt":"2026-05-25T08:33:27.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"> Thanks for this interesting paper. Are you planning to release the code ?\n\nThank you for your interest in our work. Due to company policy, we need to go through an internal review process first. We will release the code as soon as the review is complete.","html":"<blockquote>\nThanks for this interesting paper. Are you planning to release the code ?\n</blockquote>\nThank you for your interest in our work. Due to company policy, we need to go through an internal review process first. We will release the code as soon as the review is complete.\n","updatedAt":"2026-05-25T08:33:27.869Z","author":{"_id":"671b4660a3fd72a462e97330","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/671b4660a3fd72a462e97330/QbIZR5DqA6sGBvzw0vstV.jpeg","fullname":"Maohua Li","name":"Met4physics","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9669443964958191},"editors":["Met4physics"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/671b4660a3fd72a462e97330/QbIZR5DqA6sGBvzw0vstV.jpeg"],"reactions":[{"reaction":"❤️","users":["edmond"],"count":1},{"reaction":"🤗","users":["edmond"],"count":1},{"reaction":"🚀","users":["edmond"],"count":1}],"isReport":false,"parentCommentId":"6a13c1277b97d4349d2295de"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2605.20708","authors":[{"_id":"6a0e754e164dbbc68a26c4bf","name":"Chao Xu","hidden":false},{"_id":"6a0e754e164dbbc68a26c4c0","user":{"_id":"671b4660a3fd72a462e97330","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/671b4660a3fd72a462e97330/QbIZR5DqA6sGBvzw0vstV.jpeg","isPro":false,"fullname":"Maohua Li","user":"Met4physics","type":"user","name":"Met4physics"},"name":"Maohua Li","status":"claimed_verified","statusLastChangedAt":"2026-05-21T19:22:33.554Z","hidden":false},{"_id":"6a0e754e164dbbc68a26c4c1","name":"Qirui Li","hidden":false},{"_id":"6a0e754e164dbbc68a26c4c2","name":"Yixuan Xu","hidden":false},{"_id":"6a0e754e164dbbc68a26c4c3","name":"Yanke Zhou","hidden":false},{"_id":"6a0e754e164dbbc68a26c4c4","name":"Yunhe Li","hidden":false},{"_id":"6a0e754e164dbbc68a26c4c5","name":"Cuifeng Shen","hidden":false},{"_id":"6a0e754e164dbbc68a26c4c6","name":"Hanlin Tang","hidden":false},{"_id":"6a0e754e164dbbc68a26c4c7","name":"Kan Liu","hidden":false},{"_id":"6a0e754e164dbbc68a26c4c8","name":"Tao Lan","hidden":false},{"_id":"6a0e754e164dbbc68a26c4c9","name":"Lin Qu","hidden":false},{"_id":"6a0e754e164dbbc68a26c4ca","user":{"_id":"68fa40af54d82452616756e1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68fa40af54d82452616756e1/OQgJrN0c6v45BdliEuyDW.png","isPro":false,"fullname":"Shao-Qun Zhang","user":"zhangsq-nju","type":"user","name":"zhangsq-nju"},"name":"Shao-Qun Zhang","status":"claimed_verified","statusLastChangedAt":"2026-05-22T16:09:47.186Z","hidden":false}],"publishedAt":"2026-05-20T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"Rethinking Cross-Layer Information Routing in Diffusion Transformers","submittedOnDailyBy":{"_id":"671b4660a3fd72a462e97330","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/671b4660a3fd72a462e97330/QbIZR5DqA6sGBvzw0vstV.jpeg","isPro":false,"fullname":"Maohua Li","user":"Met4physics","type":"user","name":"Met4physics"},"summary":"Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs. Moreover, the proposed DAR is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet 256times256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs.\\ 9.67) and matches the baseline's converged quality with 8.75times fewer training iterations. Stacked on top of REPA, it yields a 2times training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, DAR can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.","upvotes":62,"discussionId":"6a0e754e164dbbc68a26c4cb","ai_summary":"Diffusion Transformers suffer from inefficient cross-layer information flow that traditional residual connections cannot address, prompting the introduction of a learnable, timestep-adaptive routing mechanism that improves training efficiency and model quality.","ai_keywords":["Diffusion Transformers","residual stream","cross-layer information flow","denoising timestep","residual addition","Diffusion-Adaptive Routing","REPA","FID","Distribution Matching Distillation"],"organization":{"_id":"6948e7d0a2a90d1cca14cbbc","name":"RTP-LLM","fullname":"RTP-LLM","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6426d1afbc4f1d51f5479914/lgUmPC4DXPxlhRBDnHybm.webp"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"671b4660a3fd72a462e97330","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/671b4660a3fd72a462e97330/QbIZR5DqA6sGBvzw0vstV.jpeg","isPro":false,"fullname":"Maohua Li","user":"Met4physics","type":"user"},{"_id":"68cc1613b0e7121fdac012ee","avatarUrl":"/avatars/b01fcc068d822b90f6017b934dde5922.svg","isPro":false,"fullname":"Mark","user":"MasterMarkk","type":"user"},{"_id":"69f0958d56bd23937f906ddd","avatarUrl":"/avatars/c6d976d3c9abd7ae62f517e721b3bba6.svg","isPro":false,"fullname":"cheng","user":"singleas","type":"user"},{"_id":"65ef045875c310f72da3dfa7","avatarUrl":"/avatars/f12db97b0e2facbc67bd398c524f1cdc.svg","isPro":false,"fullname":"akaitsuki","user":"akaitsuki","type":"user"},{"_id":"646719a0374fe5728d3d622e","avatarUrl":"/avatars/ee183f775dfac7130c9b096e1e5d8e82.svg","isPro":false,"fullname":"zhong","user":"zmidou","type":"user"},{"_id":"6899a20b72896d804cee4b70","avatarUrl":"/avatars/daa380377e1b83beba1e50c9a9535fce.svg","isPro":false,"fullname":"jemiry guo","user":"jemiry","type":"user"},{"_id":"67eca47b541e0bc69d65cc4f","avatarUrl":"/avatars/6c11d0b4accbaefed2b682bee5b0a47f.svg","isPro":false,"fullname":"zengfanyi","user":"eleven101","type":"user"},{"_id":"67d5b7c91222c8ef2f4f497a","avatarUrl":"/avatars/0730535854badf35ef234c39d2e17f48.svg","isPro":false,"fullname":"Ziheng Bao","user":"bzh110","type":"user"},{"_id":"673dd7199538de6b05545b6e","avatarUrl":"/avatars/905163830df497eca5298cc52d604798.svg","isPro":false,"fullname":"DXT","user":"DCodiver","type":"user"},{"_id":"673fd16d9ba5d8c33bfd545a","avatarUrl":"/avatars/aa9feefb895e7761d6242b986098f346.svg","isPro":false,"fullname":"tongdechao","user":"dechao10","type":"user"},{"_id":"69315e7a28a0362db1fa233e","avatarUrl":"/avatars/e5b01330e08368c78bdc39308be2f486.svg","isPro":false,"fullname":"molepi","user":"molepi4075","type":"user"},{"_id":"6a100dc15b5c91e258c27602","avatarUrl":"/avatars/ac0bbf5477c11d8fe91001bb22197853.svg","isPro":false,"fullname":"qirui","user":"yolandalqr","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":3,"organization":{"_id":"6948e7d0a2a90d1cca14cbbc","name":"RTP-LLM","fullname":"RTP-LLM","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6426d1afbc4f1d51f5479914/lgUmPC4DXPxlhRBDnHybm.webp"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.20708.md"}">

Papers

arxiv:2605.20708

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Published on May 20

· Submitted by

Maohua Li on May 25

#3 Paper of the day

RTP-LLM

Upvote

Authors:

Maohua Li ,

Shao-Qun Zhang

Abstract

Diffusion Transformers suffer from inefficient cross-layer information flow that traditional residual connections cannot address, prompting the introduction of a learnable, timestep-adaptive routing mechanism that improves training efficiency and model quality.

AI-generated summary

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over the history of sublayer outputs. Moreover, the proposed DAR is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet 256times256, DAR improves SiT-XL/2 by 2.11 FID (7.56 vs.\ 9.67) and matches the baseline's converged quality with 8.75times fewer training iterations. Stacked on top of REPA, it yields a 2times training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, DAR can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

View arXiv page View PDF Add to collection

Community

Met4physics

Paper author Paper submitter about 8 hours ago

Hi everyone! We’re excited to share our work “Rethinking Cross-Layer Information Routing in Diffusion Transformers.”

While DiTs have been extensively improved in tokenization, attention, conditioning, objectives, and autoencoders, their residual stream is still largely inherited from the original Transformer. We revisit this overlooked design axis and analyze cross-layer information flow in DiTs across both depth and denoising timestep.

Our diagnosis reveals three symptoms of standard residual routing: forward magnitude inflation, backward gradient decay, and block-wise redundancy. Based on this, we propose Diffusion-Adaptive Routing (DAR), a drop-in residual replacement that performs learnable, timestep-adaptive, and non-incremental aggregation over previous sublayer outputs.

On ImageNet 256×256, DAR improves SiT-XL/2 from 9.67 to 7.56 FID and reaches the baseline’s converged quality with 8.75× fewer iterations. DAR is also complementary to REPA, bringing a 2× early-stage speedup, and helps preserve high-frequency details during DMD for large-scale T2I models.

We hope this work highlights cross-layer routing as an underexplored but promising direction for Diffusion Transformers. Feedback is very welcome!