Hugging Face Daily Papers · · 6 min read

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.</p>\n","updatedAt":"2026-06-16T02:34:24.001Z","author":{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","fullname":"wangshuai","name":"wangsssssss","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":24,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8678892254829407},"editors":["wangsssssss"],"editorAvatarUrls":["/avatars/0ff3127b513552432a7c651e21d7f283.svg"],"reactions":[{"reaction":"🤯","users":["wangsssssss"],"count":1},{"reaction":"🔥","users":["zgzaacm"],"count":1},{"reaction":"🚀","users":["zgzaacm"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.16255","authors":[{"_id":"6a30b5b2a0d4daae4285fd60","user":{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","isPro":false,"fullname":"wangshuai","user":"wangsssssss","type":"user","name":"wangsssssss"},"name":"Shuai Wang","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:07:20.485Z","hidden":false},{"_id":"6a30b5b2a0d4daae4285fd61","name":"Liang Li","hidden":false},{"_id":"6a30b5b2a0d4daae4285fd62","name":"Yang Chen","hidden":false},{"_id":"6a30b5b2a0d4daae4285fd63","name":"Ruopeng Gao","hidden":false},{"_id":"6a30b5b2a0d4daae4285fd64","name":"Yao Teng","hidden":false},{"_id":"6a30b5b2a0d4daae4285fd65","name":"Limin Wang","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer","submittedOnDailyBy":{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","isPro":false,"fullname":"wangshuai","user":"wangsssssss","type":"user","name":"wangsssssss"},"summary":"Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.","upvotes":8,"discussionId":"6a30b5b2a0d4daae4285fd66","projectPage":"https://huggingface.co/papers/2606.16255","ai_summary":"UniDDT addresses key challenges in unified multimodal models by leveraging a Noisy ViT encoder and LLM for semantic encoding while using separate diffusion decoders to balance visual understanding and generation tasks.","ai_keywords":["Unified Multimodal Models","Noisy ViT encoder","LLM","semantic encoding","diffusion decoder","latent space","visual generation","multimodal understanding","dual data structures","GenEval score","DPG score","MME benchmark","SEEDbench"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"638f70e8f1256a80d4288555","name":"nanjinguniv","fullname":"Nanjing University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f706ef1256a80d42880f9/6M6-JzwJGiLxjIJzvCflf.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66615c855fd9d736e670e0a9","avatarUrl":"/avatars/0ff3127b513552432a7c651e21d7f283.svg","isPro":false,"fullname":"wangshuai","user":"wangsssssss","type":"user"},{"_id":"66320391060ab1f666819e9c","avatarUrl":"/avatars/42856c3066f49ff183b10b5239daa6ab.svg","isPro":false,"fullname":"Ruopeng Gao","user":"HELLORPG","type":"user"},{"_id":"60c8d264224e250fb0178f77","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60c8d264224e250fb0178f77/i8fbkBVcoFeJRmkQ9kYAE.png","isPro":false,"fullname":"Adam Lee","user":"Abecid","type":"user"},{"_id":"651f8133dbf879b8c58f5136","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/651f8133dbf879b8c58f5136/0L8Ecgi5Ietkm_DchJwE-.png","isPro":false,"fullname":"Zikai Zhou","user":"Klayand","type":"user"},{"_id":"6684152a443492c24cdac044","avatarUrl":"/avatars/1d5abbde12a808aa743769603e494ddb.svg","isPro":false,"fullname":"Guozhen Zhang","user":"zgzaacm","type":"user"},{"_id":"63ea23b9dedfeebe54d02bdf","avatarUrl":"/avatars/4d9f9a546aa8c63e277161ea700075c4.svg","isPro":false,"fullname":"Yuqing Wang","user":"Epiphqny","type":"user"},{"_id":"65d851096769b3a9c9376134","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d851096769b3a9c9376134/j_d2RSa-3rRmSAmRICSk0.jpeg","isPro":false,"fullname":"ZehongMa","user":"zehongma","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"638f70e8f1256a80d4288555","name":"nanjinguniv","fullname":"Nanjing University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/638f706ef1256a80d42880f9/6M6-JzwJGiLxjIJzvCflf.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.16255.md","query":{}}">
Papers
arxiv:2606.16255

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

Published on Jun 15
· Submitted by
wangshuai
on Jun 16
Authors:
,
,
,
,

Abstract

UniDDT addresses key challenges in unified multimodal models by leveraging a Noisy ViT encoder and LLM for semantic encoding while using separate diffusion decoders to balance visual understanding and generation tasks.

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.

Community

Paper author Paper submitter about 10 hours ago

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.16255
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.16255 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.16255 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.16255 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers