Hugging Face Daily Papers · May 18, 2026 · 6 min read

Efficient Image Synthesis with Sphere Latent Encoder

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

🚀 Sphere Latent Encoder: Efficient Image Synthesis with Spherical Latent Denoising\nThis paper proposes Sphere Latent Encoder, an efficient few-step image generation framework that performs denoising entirely in a spherical latent space. Instead of repeatedly moving between pixel space and latent space as in the original Sphere Encoder, the method uses a fixed pretrained representation autoencoder and trains a separate latent denoising model. This decouples reconstruction from generation and makes sampling much more efficient.\nKey idea: Use a pretrained RAE/DINOv2-based encoder as a strong image tokenizer, project noisy latents onto a hypersphere, and train a transformer denoiser directly in that latent space. During inference, the model refines latents over only a few steps and calls the decoder once at the end.\nWhy it matters: The approach keeps the simplicity of Sphere Encoder while removing its main bottleneck: repeated encoder-decoder transitions. This leads to substantially lower computational cost and better sample quality in the low-step regime.\nHighlights:\n<ul>\n<li>Generates high-quality 256×256 images in only a few sampling steps.</li>\n<li>Reduces inference cost by avoiding repeated pixel-latent conversions.</li>\n<li>Improves over Sphere Encoder on Animal-Faces, Oxford-Flowers, and ImageNet-1K.</li>\n<li>Achieves strong ImageNet-1K results, improving FID from 4.02 to 2.25 at the same 4-step CFG setting, and to 2.11 with 6 steps.</li>\n<li>Ablations show that spherical projection, consistency loss, noise distribution, and the choice of representation autoencoder are all important for performance.</li>\n</ul>\nA particularly interesting takeaway is that strong semantic latent representations plus spherical latent modeling can provide a practical alternative to standard diffusion/flow sampling, especially when low-NFE generation is the priority.\nLimitations are also clear: the current experiments focus on class-conditional generation, rely on a strong pretrained representation autoencoder, and still find high-quality one-step generation challenging. Overall, this is a promising direction for efficient latent-space generative modeling.\n","updatedAt":"2026-05-18T12:41:21.133Z","author":{"_id":"64b4df28d52d67c01c033e82","avatarUrl":"/avatars/4e74d92954803c58005f119c3d52150f.svg","fullname":"Do Thanh Tung","name":"itsthanhtung","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8534033298492432},"editors":["itsthanhtung"],"editorAvatarUrls":["/avatars/4e74d92954803c58005f119c3d52150f.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15592","authors":[{"_id":"6a0a71f375184a0d71e025be","user":{"_id":"64b4df28d52d67c01c033e82","avatarUrl":"/avatars/4e74d92954803c58005f119c3d52150f.svg","isPro":false,"fullname":"Do Thanh Tung","user":"itsthanhtung","type":"user","name":"itsthanhtung"},"name":"Tung Do","status":"claimed_verified","statusLastChangedAt":"2026-05-18T09:40:47.084Z","hidden":false},{"_id":"6a0a71f375184a0d71e025bf","user":{"_id":"633d4b6bb8ac3a16a5181ec2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633d4b6bb8ac3a16a5181ec2/bgD_lF2NwREkWIB4yki4q.jpeg","isPro":false,"fullname":"Thuan Nguyen Hoang","user":"thuanz123","type":"user","name":"thuanz123"},"name":"Thuan Hoang Nguyen","status":"claimed_verified","statusLastChangedAt":"2026-05-18T09:40:40.614Z","hidden":false},{"_id":"6a0a71f375184a0d71e025c0","name":"Hao Li","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/64b4df28d52d67c01c033e82/-upUTY05sjbqU5DZJBUhr.png"],"publishedAt":"2026-05-15T00:00:00.000Z","submittedOnDailyAt":"2026-05-18T00:00:00.000Z","title":"Efficient Image Synthesis with Sphere Latent Encoder","submittedOnDailyBy":{"_id":"64b4df28d52d67c01c033e82","avatarUrl":"/avatars/4e74d92954803c58005f119c3d52150f.svg","isPro":false,"fullname":"Do Thanh Tung","user":"itsthanhtung","type":"user","name":"itsthanhtung"},"summary":"Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.","upvotes":5,"discussionId":"6a0a71f475184a0d71e025c1","projectPage":"https://sphere-latent-encoder.github.io","ai_summary":"A decoupled framework for few-step image generation that improves efficiency and performance by separating pixel-space operations from latent denoising training.","ai_keywords":["sphere encoder","latent denoising model","spherical latent space","pixel space","latent space","image encoder","generation quality","inference speed"],"organization":{"_id":"61fb9e24dc607a42af5f193f","name":"MBZUAI","fullname":"Mohamed Bin Zayed University of Artificial Intelligence","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643879908583-603ab5664a944b99e81476e8.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b4df28d52d67c01c033e82","avatarUrl":"/avatars/4e74d92954803c58005f119c3d52150f.svg","isPro":false,"fullname":"Do Thanh Tung","user":"itsthanhtung","type":"user"},{"_id":"69bb54285463ded25e33655f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/bVg_39Y81NqV1o52j3X64.png","isPro":false,"fullname":"小川健太","user":"evelyndavis","type":"user"},{"_id":"640d0dbc8036cc2142273a83","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/640d0dbc8036cc2142273a83/cicTWJVqqvQv_DgDucWgY.jpeg","isPro":false,"fullname":"Kaiyu Yue","user":"kaiyuyue","type":"user"},{"_id":"633d4b6bb8ac3a16a5181ec2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/633d4b6bb8ac3a16a5181ec2/bgD_lF2NwREkWIB4yki4q.jpeg","isPro":false,"fullname":"Thuan Nguyen Hoang","user":"thuanz123","type":"user"},{"_id":"68b144bc53b2c9be17126ddc","avatarUrl":"/avatars/337d1a1044e40b92de100142b63c5356.svg","isPro":true,"fullname":"Duy Le","user":"leduy99","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61fb9e24dc607a42af5f193f","name":"MBZUAI","fullname":"Mohamed Bin Zayed University of Artificial Intelligence","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1643879908583-603ab5664a944b99e81476e8.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15592.md"}">

Papers

arxiv:2605.15592

Efficient Image Synthesis with Sphere Latent Encoder

Published on May 15

· Submitted by

Do Thanh Tung on May 18

Mohamed Bin Zayed University of Artificial Intelligence

Upvote

Authors:

Tung Do ,

Thuan Hoang Nguyen ,

Abstract

A decoupled framework for few-step image generation that improves efficiency and performance by separating pixel-space operations from latent denoising training.

AI-generated summary

Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.

View arXiv page View PDF Project page Add to collection

Community

itsthanhtung

Paper author Paper submitter about 13 hours ago

🚀 Sphere Latent Encoder: Efficient Image Synthesis with Spherical Latent Denoising

This paper proposes Sphere Latent Encoder, an efficient few-step image generation framework that performs denoising entirely in a spherical latent space. Instead of repeatedly moving between pixel space and latent space as in the original Sphere Encoder, the method uses a fixed pretrained representation autoencoder and trains a separate latent denoising model. This decouples reconstruction from generation and makes sampling much more efficient.

Key idea:
Use a pretrained RAE/DINOv2-based encoder as a strong image tokenizer, project noisy latents onto a hypersphere, and train a transformer denoiser directly in that latent space. During inference, the model refines latents over only a few steps and calls the decoder once at the end.

Why it matters:
The approach keeps the simplicity of Sphere Encoder while removing its main bottleneck: repeated encoder-decoder transitions. This leads to substantially lower computational cost and better sample quality in the low-step regime.

Highlights:

Generates high-quality 256×256 images in only a few sampling steps.
Reduces inference cost by avoiding repeated pixel-latent conversions.
Improves over Sphere Encoder on Animal-Faces, Oxford-Flowers, and ImageNet-1K.
Achieves strong ImageNet-1K results, improving FID from 4.02 to 2.25 at the same 4-step CFG setting, and to 2.11 with 6 steps.
Ablations show that spherical projection, consistency loss, noise distribution, and the choice of representation autoencoder are all important for performance.

A particularly interesting takeaway is that strong semantic latent representations plus spherical latent modeling can provide a practical alternative to standard diffusion/flow sampling, especially when low-NFE generation is the priority.

Limitations are also clear: the current experiments focus on class-conditional generation, rely on a strong pretrained representation autoencoder, and still find high-quality one-step generation challenging. Overall, this is a promising direction for efficient latent-space generative modeling.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.15592

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15592 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.15592 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15592 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Efficient Image Synthesis with Sphere Latent Encoder

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers