Latent alignment for UMMs</p>\n","updatedAt":"2026-05-25T10:28:13.062Z","author":{"_id":"6204cc0d522e40b4a18d86e2","avatarUrl":"/avatars/18daf2de5671e711dc745388dd60569d.svg","fullname":"Jindong Wang","name":"jindongwang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7183265686035156},"editors":["jindongwang"],"editorAvatarUrls":["/avatars/18daf2de5671e711dc745388dd60569d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.17766","authors":[{"_id":"6a1424224d9e8d8602d203d4","name":"Yinyi Luo","hidden":false},{"_id":"6a1424224d9e8d8602d203d5","name":"Wenwen Wang","hidden":false},{"_id":"6a1424224d9e8d8602d203d6","name":"Hayes Bai","hidden":false},{"_id":"6a1424224d9e8d8602d203d7","name":"Marios Savvides","hidden":false},{"_id":"6a1424224d9e8d8602d203d8","name":"Jindong Wang","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"LatentUMM: Dual Latent Alignment for Unified Multimodal Models","submittedOnDailyBy":{"_id":"6204cc0d522e40b4a18d86e2","avatarUrl":"/avatars/18daf2de5671e711dc745388dd60569d.svg","isPro":false,"fullname":"Jindong Wang","user":"jindongwang","type":"user","name":"jindongwang"},"summary":"Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.","upvotes":2,"discussionId":"6a1424224d9e8d8602d203d9","ai_summary":"LatentUMM addresses multimodal consistency issues by constructing an enhanced shared latent space that explicitly aligns transformations between modalities and stabilizes latent dynamics during generation and re-encoding processes.","ai_keywords":["unified multimodal models","shared latent space","cross-modal alignment","dual capacity alignment","latent dynamics stabilization","stochastic latent rollouts","preference optimization","semantic consistency"],"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6a1439c99e64e736ee73270a","avatarUrl":"/avatars/f498c48ba8023828174fbbc681834789.svg","isPro":false,"fullname":"Li Wu","user":"qwen12334","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.17766.md"}">
LatentUMM: Dual Latent Alignment for Unified Multimodal Models
Abstract
LatentUMM addresses multimodal consistency issues by constructing an enhanced shared latent space that explicitly aligns transformations between modalities and stabilizes latent dynamics during generation and re-encoding processes.
AI-generated summary
Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.
Community
Latent alignment for UMMs
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.17766 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.17766 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.17766 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.