Hugging Face Daily Papers · · 4 min read

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3x faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.</p>\n","updatedAt":"2026-05-22T10:11:07.735Z","author":{"_id":"665eccf5ffd59344a22533a8","avatarUrl":"/avatars/2ae2710753ce34a04937384bc6dddf70.svg","fullname":"Wei Song (SII)","name":"Songweii","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.898081362247467},"editors":["Songweii"],"editorAvatarUrls":["/avatars/2ae2710753ce34a04937384bc6dddf70.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22777","authors":[{"_id":"6a0fc356a53a61ce2e422ca3","name":"Tianhang Wang","hidden":false},{"_id":"6a0fc356a53a61ce2e422ca4","name":"Yitong Chen","hidden":false},{"_id":"6a0fc356a53a61ce2e422ca5","name":"Wei Song","hidden":false},{"_id":"6a0fc356a53a61ce2e422ca6","name":"Zuxuan Wu","hidden":false},{"_id":"6a0fc356a53a61ce2e422ca7","name":"Min Li","hidden":false},{"_id":"6a0fc356a53a61ce2e422ca8","name":"Jiaqi Wang","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders","submittedOnDailyBy":{"_id":"665eccf5ffd59344a22533a8","avatarUrl":"/avatars/2ae2710753ce34a04937384bc6dddf70.svg","isPro":false,"fullname":"Wei Song (SII)","user":"Songweii","type":"user","name":"Songweii"},"summary":"Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3times faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.","upvotes":1,"discussionId":"6a0fc357a53a61ce2e422ca9","githubRepo":"https://github.com/Tianhang-Wang/DecQ","githubRepoAddedBy":"user","ai_summary":"DecQ enhances representation autoencoders by introducing lightweight queries that improve reconstruction quality and generative performance without disrupting pretrained semantic spaces.","ai_keywords":["representation autoencoders","vision foundation models","latent diffusion models","frozen vision foundation models","detail-condensing queries","condenser modules","patch tokens","reconstruction quality","generative modeling","FID","PSNR"],"githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"624862b4a460a8870c9d6a48","avatarUrl":"/avatars/479bc415ee624528e910f22bdb344b23.svg","isPro":false,"fullname":"Tianhang Wang (SII)","user":"tianhang-wang","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22777.md"}">
Papers
arxiv:2605.22777

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

Published on May 21
· Submitted by
Wei Song (SII)
on May 22
Authors:
,
,
,
,
,

Abstract

DecQ enhances representation autoencoders by introducing lightweight queries that improve reconstruction quality and generative performance without disrupting pretrained semantic spaces.

AI-generated summary

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3times faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.

Community

Paper submitter about 2 hours ago

DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3x faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.22777
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22777 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.22777 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22777 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers