Hugging Face Daily Papers · · 4 min read

Cross-scale Aligned Supervision for Training GANs

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

CAT enables scalable one-step image generation by adding cross-scale consistency to a transformer-based GAN, achieving 1.56 FID in only 60 epochs.</p>\n","updatedAt":"2026-05-27T00:37:45.344Z","author":{"_id":"6359305490a313828b9ed928","avatarUrl":"/avatars/a41b3b11922234f4ff268bbd26255a8f.svg","fullname":"Sangeek Hyun","name":"hsi1032","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8116202354431152},"editors":["hsi1032"],"editorAvatarUrls":["/avatars/a41b3b11922234f4ff268bbd26255a8f.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26449","authors":[{"_id":"6a163bdbe9aa3c8e322db273","name":"Sangeek Hyun","hidden":false},{"_id":"6a163bdbe9aa3c8e322db274","name":"MinKyu Lee","hidden":false},{"_id":"6a163bdbe9aa3c8e322db275","name":"Jae-Pil Heo","hidden":false}],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"Cross-scale Aligned Supervision for Training GANs","submittedOnDailyBy":{"_id":"6359305490a313828b9ed928","avatarUrl":"/avatars/a41b3b11922234f4ff268bbd26255a8f.svg","isPro":false,"fullname":"Sangeek Hyun","user":"hsi1032","type":"user","name":"hsi1032"},"summary":"Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.","upvotes":1,"discussionId":"6a163bdbe9aa3c8e322db276","ai_summary":"Standard GANs with adversarial supervision on intermediate outputs fail to maintain consistent sample trajectories across scales, leading to misalignment; a new transformer-based approach called CAT addresses this by enforcing consistency between intermediate and final outputs.","ai_keywords":["GANs","adversarial supervision","multi-stage synthesis","coarse-to-fine hierarchy","scale-wise realism","cross-scale trajectory misalignment","Cross-scale Aligned Transformer","generator-side consistency regularization","FID-50K","ImageNet-256"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6359305490a313828b9ed928","avatarUrl":"/avatars/a41b3b11922234f4ff268bbd26255a8f.svg","isPro":false,"fullname":"Sangeek Hyun","user":"hsi1032","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26449.md"}">
Papers
arxiv:2605.26449

Cross-scale Aligned Supervision for Training GANs

Published on May 26
· Submitted by
Sangeek Hyun
on May 26
Authors:
,
,

Abstract

Standard GANs with adversarial supervision on intermediate outputs fail to maintain consistent sample trajectories across scales, leading to misalignment; a new transformer-based approach called CAT addresses this by enforcing consistency between intermediate and final outputs.

AI-generated summary

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.

Community

Paper submitter 23 minutes ago

CAT enables scalable one-step image generation by adding cross-scale consistency to a transformer-based GAN, achieving 1.56 FID in only 60 epochs.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.26449
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26449 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.26449 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26449 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers