Hugging Face Daily Papers · · 3 min read

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

TLDR: SEGA is a training-free method that uses spectral guidance to modify attention behavior through RoPE components scaling, improving high-resolution generation in diffusion transformers.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/672162a42dfd290c4647160d/8vUO8MTIg4NKl0xpa6zu-.jpeg\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/672162a42dfd290c4647160d/8vUO8MTIg4NKl0xpa6zu-.jpeg\" alt=\"teaser (1)_page-0001 (1)\"></a></p>\n","updatedAt":"2026-05-22T13:08:43.809Z","author":{"_id":"672162a42dfd290c4647160d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/672162a42dfd290c4647160d/6rCjhUAHfSow8vEbl4JCp.jpeg","fullname":"Javad Rajabi","name":"Nova2001","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5854062438011169},"editors":["Nova2001"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/672162a42dfd290c4647160d/6rCjhUAHfSow8vEbl4JCp.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22668","authors":[{"_id":"6a10532da53a61ce2e422fc7","user":{"_id":"672162a42dfd290c4647160d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/672162a42dfd290c4647160d/6rCjhUAHfSow8vEbl4JCp.jpeg","isPro":false,"fullname":"Javad Rajabi","user":"Nova2001","type":"user","name":"Nova2001"},"name":"Javad Rajabi","status":"claimed_verified","statusLastChangedAt":"2026-05-22T15:59:06.333Z","hidden":false},{"_id":"6a10532da53a61ce2e422fc8","name":"Kimia Shaban","hidden":false},{"_id":"6a10532da53a61ce2e422fc9","user":{"_id":"6966a829b6dd7c2164d9f295","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6966a829b6dd7c2164d9f295/UbhDqmfltzcAh9KMj1j5X.jpeg","isPro":false,"fullname":"Koorosh Roohi","user":"kooroshrh","type":"user","name":"kooroshrh"},"name":"Koorosh Roohi","status":"claimed_verified","statusLastChangedAt":"2026-05-22T15:59:04.618Z","hidden":false},{"_id":"6a10532da53a61ce2e422fca","name":"David B. Lindell","hidden":false},{"_id":"6a10532da53a61ce2e422fcb","name":"Babak Taati","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers","submittedOnDailyBy":{"_id":"672162a42dfd290c4647160d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/672162a42dfd290c4647160d/6rCjhUAHfSow8vEbl4JCp.jpeg","isPro":false,"fullname":"Javad Rajabi","user":"Nova2001","type":"user","name":"Nova2001"},"summary":"Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.","upvotes":22,"discussionId":"6a10532da53a61ce2e422fcc","projectPage":"https://rajabi2001.github.io/sega/","ai_summary":"SEGA improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.","ai_keywords":["diffusion transformers","text-to-image generation","Rotary Position Embeddings","attention scaling","spatial-frequency structure","denoising steps","high-resolution synthesis"],"organization":{"_id":"6591d63868d0b76331f82b3b","name":"uoft-cs","fullname":"University of Toronto Computer Science","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e77ccfa3d77a72421292d0c/8cyh-p0m5Hbz8xhOxjfuI.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"672162a42dfd290c4647160d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/672162a42dfd290c4647160d/6rCjhUAHfSow8vEbl4JCp.jpeg","isPro":false,"fullname":"Javad Rajabi","user":"Nova2001","type":"user"},{"_id":"6a0f51d9c13d4670841a3557","avatarUrl":"/avatars/5b7f3e04d9b3e27bb703ba831e4076a0.svg","isPro":false,"fullname":"Andrew Fleet","user":"andrewf2007","type":"user"},{"_id":"6375965008eebfdd0a399891","avatarUrl":"/avatars/946768f40a18793ced82f09a1de47952.svg","isPro":false,"fullname":"Soroush Mehraban","user":"SoroushMehraban","type":"user"},{"_id":"6966a829b6dd7c2164d9f295","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6966a829b6dd7c2164d9f295/UbhDqmfltzcAh9KMj1j5X.jpeg","isPro":false,"fullname":"Koorosh Roohi","user":"kooroshrh","type":"user"},{"_id":"6a0383b4acd78d3c3bed489d","avatarUrl":"/avatars/fb49d80bcc8c8b08423221d0ff6073b1.svg","isPro":false,"fullname":"Andrew Peng","user":"andrew-canada","type":"user"},{"_id":"6631b79e351231c428035e78","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6631b79e351231c428035e78/-QtkAybw7keXqkXivEvXU.jpeg","isPro":false,"fullname":"Kimia Shaban","user":"kimiashaban","type":"user"},{"_id":"689e4a5b555fe48d3448950f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/689e4a5b555fe48d3448950f/lFT_s5j2gxSltAXqHtOrk.png","isPro":false,"fullname":"Hakki Karaimer","user":"karaimer","type":"user"},{"_id":"6a105d9607505ad2ceb96144","avatarUrl":"/avatars/f9898053fe0cccf77968828da69f7bf2.svg","isPro":false,"fullname":"Narges Nezhad","user":"NargesMN","type":"user"},{"_id":"65b43d30625ac670a71cbbf3","avatarUrl":"/avatars/1d6366ba0a7a829ed4d5cc483f4335ac.svg","isPro":false,"fullname":"Vida Adeli","user":"vida-adl","type":"user"},{"_id":"65c2b4894afe701149cac4b0","avatarUrl":"/avatars/2087cf355fcb339cb8a8052f6aef2102.svg","isPro":false,"fullname":"Alex Levinshtein","user":"babalex","type":"user"},{"_id":"6597a21a92afb150dd0bef11","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/J2i1O2eqBzcy48u_6m0Ws.jpeg","isPro":false,"fullname":"TavakoliAfshari","user":"SeyedMatin","type":"user"},{"_id":"634e3506d00bb5d92c3fae91","avatarUrl":"/avatars/5d9c1ef604a6c8542f964832f9954750.svg","isPro":false,"fullname":"hue nguyen","user":"huent189","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6591d63868d0b76331f82b3b","name":"uoft-cs","fullname":"University of Toronto Computer Science","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5e77ccfa3d77a72421292d0c/8cyh-p0m5Hbz8xhOxjfuI.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22668.md"}">
Papers
arxiv:2605.22668

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Published on May 21
· Submitted by
Javad Rajabi
on May 22
Authors:
,
,

Abstract

SEGA improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.

AI-generated summary

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.

Community

Paper author Paper submitter about 13 hours ago

TLDR: SEGA is a training-free method that uses spectral guidance to modify attention behavior through RoPE components scaling, improving high-resolution generation in diffusion transformers.

teaser (1)_page-0001 (1)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.22668
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22668 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.22668 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22668 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers