Hugging Face Daily Papers · May 13, 2026 · 5 min read

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law (R2=0.86) between fusion capacity and reconstruction quality, identifying \\textit{representation richness} as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.</p>\n","updatedAt":"2026-05-13T02:30:47.078Z","author":{"_id":"673c7319d11b1c2e246ead9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg","fullname":"Yang Shi","name":"DogNeverSleep","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8465093970298767},"editors":["DogNeverSleep"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.10780","authors":[{"_id":"6a03e23e86b054ce2fa40daf","user":{"_id":"644d2532d185572dd1e48f90","avatarUrl":"/avatars/5831acebb02d8bc8f80f56b7b11c7c69.svg","isPro":false,"fullname":"Zhu","user":"zzzhu","type":"user","name":"zzzhu"},"name":"Xuanyu Zhu","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:44:23.031Z","hidden":false},{"_id":"6a03e23e86b054ce2fa40db0","name":"Yan Bai","hidden":false},{"_id":"6a03e23e86b054ce2fa40db1","user":{"_id":"673c7319d11b1c2e246ead9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg","isPro":false,"fullname":"Yang Shi","user":"DogNeverSleep","type":"user","name":"DogNeverSleep"},"name":"Yang Shi","status":"claimed_verified","statusLastChangedAt":"2026-05-13T07:44:25.508Z","hidden":false},{"_id":"6a03e23e86b054ce2fa40db2","name":"Yihang Lou","hidden":false},{"_id":"6a03e23e86b054ce2fa40db3","name":"Yuanxing Zhang","hidden":false},{"_id":"6a03e23e86b054ce2fa40db4","name":"Jing Jin","hidden":false},{"_id":"6a03e23e86b054ce2fa40db5","name":"Yuan Zhou","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization","submittedOnDailyBy":{"_id":"673c7319d11b1c2e246ead9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg","isPro":false,"fullname":"Yang Shi","user":"DogNeverSleep","type":"user","name":"DogNeverSleep"},"summary":"Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law (R^2{=}0.86) between fusion capacity and reconstruction quality, identifying representation richness as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.","upvotes":25,"discussionId":"6a03e23e86b054ce2fa40db6","githubRepo":"https://github.com/zhuzil/DRoRAE","githubRepoAddedBy":"user","ai_summary":"DRoRAE enhances visual representation by fusing multi-layer features from pretrained vision encoders through adaptive routing and incremental correction, improving reconstruction and generation quality.","ai_keywords":["representation autoencoders","frozen pretrained vision encoders","visual tokenizers","multi-layer feature fusion","depth-routed representation autoencoder","energy-constrained routing","incremental correction","three-phase decoupled training","rFID","generation FID","AutoGuidance","log-linear scaling law","representation richness"],"githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"673c7319d11b1c2e246ead9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg","isPro":false,"fullname":"Yang Shi","user":"DogNeverSleep","type":"user"},{"_id":"644d2532d185572dd1e48f90","avatarUrl":"/avatars/5831acebb02d8bc8f80f56b7b11c7c69.svg","isPro":false,"fullname":"Zhu","user":"zzzhu","type":"user"},{"_id":"69bd14cbefd3cf23b128a231","avatarUrl":"/avatars/0a2800f255d86d7a9bdc96d7bcac7a34.svg","isPro":false,"fullname":"jinjing","user":"jinjing777","type":"user"},{"_id":"64241749a05235e2f8d34cb0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64241749a05235e2f8d34cb0/o6CY4xS22W8_DIqesFykM.jpeg","isPro":false,"fullname":"Yuanxing Zhang","user":"LongoXC","type":"user"},{"_id":"66650d38b52f0890724f3b07","avatarUrl":"/avatars/c25a365bff4985ebb71c96dd097b804f.svg","isPro":false,"fullname":"Xinlong Chen","user":"XinlongChen","type":"user"},{"_id":"679ef3f48e344720aebc6be3","avatarUrl":"/avatars/25d54f7fac596845ab889371124ce508.svg","isPro":false,"fullname":"Yunjie Liu","user":"SurvivorNo1","type":"user"},{"_id":"68d537ea1d2ee6800f0b57e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68d537ea1d2ee6800f0b57e6/9D2Dnwz_NIyHhYyRQk4nC.jpeg","isPro":false,"fullname":"vicky","user":"Vickyinmyheart824","type":"user"},{"_id":"675a69699e086bd6250a36ef","avatarUrl":"/avatars/95c72e3975d1a37f8655a2fe629746ec.svg","isPro":false,"fullname":"Weihong Lin","user":"lwher1996","type":"user"},{"_id":"6700b2b6bff0e8b51d07fa00","avatarUrl":"/avatars/6cd7e243b7bc37ae9d308c175cbe6f05.svg","isPro":false,"fullname":"asdasd","user":"asdjghh","type":"user"},{"_id":"661e62c6bac5d981f886f77b","avatarUrl":"/avatars/f1eb51ed4499ca434c8939573dfbd5e2.svg","isPro":false,"fullname":"Bozhou Li","user":"zooblastlbz","type":"user"},{"_id":"660781a450d2b7a71091240d","avatarUrl":"/avatars/da9439b8920605d8427893d0ebc32dfa.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbh0217","type":"user"},{"_id":"61540338e5b9ae6774201e58","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/61540338e5b9ae6774201e58/h_159VrXOlIgu0N0pNgXj.png","isPro":false,"fullname":"jingyun","user":"hjy","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.10780.md"}">

Papers

arxiv:2605.10780

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Published on May 12

· Submitted by

Yang Shi on May 13

Upvote

Authors:

Xuanyu Zhu ,

Yang Shi ,

Abstract

DRoRAE enhances visual representation by fusing multi-layer features from pretrained vision encoders through adaptive routing and incremental correction, improving reconstruction and generation quality.

AI-generated summary

View arXiv page View PDF GitHub 4 Add to collection

Community

DogNeverSleep

Paper author Paper submitter about 19 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.10780

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.10780 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.10780 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.10780 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers