Hugging Face Daily Papers · · 5 min read

MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.</p>\n","updatedAt":"2026-06-16T07:36:35.364Z","author":{"_id":"64b7ae216ab5d14ca7febde0","avatarUrl":"/avatars/bb18ece0b50eb72dd6df6e32e5051d52.svg","fullname":"Orest Kupyn","name":"okupyn","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":9,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8930124640464783},"editors":["okupyn"],"editorAvatarUrls":["/avatars/bb18ece0b50eb72dd6df6e32e5051d52.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.16673","authors":[{"_id":"6a30fccda0d4daae42860328","name":"Yagmur Akarken","hidden":false},{"_id":"6a30fccda0d4daae42860329","user":{"_id":"64b7ae216ab5d14ca7febde0","avatarUrl":"/avatars/bb18ece0b50eb72dd6df6e32e5051d52.svg","isPro":false,"fullname":"Orest Kupyn","user":"okupyn","type":"user","name":"okupyn"},"name":"Orest Kupyn","status":"claimed_verified","statusLastChangedAt":"2026-06-16T09:47:25.236Z","hidden":false},{"_id":"6a30fccda0d4daae4286032a","name":"Christian Rupprecht","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"MMDiff: Extending Diffusion Transformers for Multi-Modal Generation","submittedOnDailyBy":{"_id":"64b7ae216ab5d14ca7febde0","avatarUrl":"/avatars/bb18ece0b50eb72dd6df6e32e5051d52.svg","isPro":false,"fullname":"Orest Kupyn","user":"okupyn","type":"user","name":"okupyn"},"summary":"Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.","upvotes":3,"discussionId":"6a30fccda0d4daae4286032b","projectPage":"https://yagmurakarken.github.io/mmdiff/","ai_summary":"MMDiff transforms frozen diffusion transformers into multi-modal generative systems that produce images and perceptual modalities using lightweight decoders, achieving improved semantic segmentation through multi-timestep feature fusion and spatial aggregation.","ai_keywords":["diffusion transformers","denoising trajectory","multi-modal generative system","lightweight decoder heads","multi-timestep feature fusion","spatially varying aggregation weights","semantic segmentation","salient object detection","depth estimation","concept-driven attention extraction","DINOv3","synthetic data generation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"627bbc28fbab61b048eba8b6","name":"Oxford","fullname":"University of Oxford","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/u0ey2LfYu6uG6iu8m_kH7.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64b7ae216ab5d14ca7febde0","avatarUrl":"/avatars/bb18ece0b50eb72dd6df6e32e5051d52.svg","isPro":false,"fullname":"Orest Kupyn","user":"okupyn","type":"user"},{"_id":"68a3eeb57aca8caa6e9ba41c","avatarUrl":"/avatars/692b066b41d99fc14a7c72226ed2cdc7.svg","isPro":false,"fullname":"JasonCocomo","user":"JasonCocomo001","type":"user"},{"_id":"62de6b969ea42799ee5e0a80","avatarUrl":"/avatars/15cead70ac33f21b117162ac50befe25.svg","isPro":false,"fullname":"Yağmur Akarken","user":"yagmurakarken","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"627bbc28fbab61b048eba8b6","name":"Oxford","fullname":"University of Oxford","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/u0ey2LfYu6uG6iu8m_kH7.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.16673.md","query":{}}">
Papers
arxiv:2606.16673

MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

Published on Jun 15
· Submitted by
Orest Kupyn
on Jun 16
Authors:
,

Abstract

MMDiff transforms frozen diffusion transformers into multi-modal generative systems that produce images and perceptual modalities using lightweight decoders, achieving improved semantic segmentation through multi-timestep feature fusion and spatial aggregation.

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.

Community

Paper author Paper submitter about 5 hours ago

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.16673
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.16673 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.16673 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.16673 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers