Hugging Face Daily Papers · · 3 min read

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Latent alignment for UMMs</p>\n","updatedAt":"2026-05-25T10:28:13.062Z","author":{"_id":"6204cc0d522e40b4a18d86e2","avatarUrl":"/avatars/18daf2de5671e711dc745388dd60569d.svg","fullname":"Jindong Wang","name":"jindongwang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7183265686035156},"editors":["jindongwang"],"editorAvatarUrls":["/avatars/18daf2de5671e711dc745388dd60569d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.17766","authors":[{"_id":"6a1424224d9e8d8602d203d4","name":"Yinyi Luo","hidden":false},{"_id":"6a1424224d9e8d8602d203d5","name":"Wenwen Wang","hidden":false},{"_id":"6a1424224d9e8d8602d203d6","name":"Hayes Bai","hidden":false},{"_id":"6a1424224d9e8d8602d203d7","name":"Marios Savvides","hidden":false},{"_id":"6a1424224d9e8d8602d203d8","name":"Jindong Wang","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"LatentUMM: Dual Latent Alignment for Unified Multimodal Models","submittedOnDailyBy":{"_id":"6204cc0d522e40b4a18d86e2","avatarUrl":"/avatars/18daf2de5671e711dc745388dd60569d.svg","isPro":false,"fullname":"Jindong Wang","user":"jindongwang","type":"user","name":"jindongwang"},"summary":"Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.","upvotes":2,"discussionId":"6a1424224d9e8d8602d203d9","ai_summary":"LatentUMM addresses multimodal consistency issues by constructing an enhanced shared latent space that explicitly aligns transformations between modalities and stabilizes latent dynamics during generation and re-encoding processes.","ai_keywords":["unified multimodal models","shared latent space","cross-modal alignment","dual capacity alignment","latent dynamics stabilization","stochastic latent rollouts","preference optimization","semantic consistency"],"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6a1439c99e64e736ee73270a","avatarUrl":"/avatars/f498c48ba8023828174fbbc681834789.svg","isPro":false,"fullname":"Li Wu","user":"qwen12334","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"691d9a1012cc4d473e1c862f","name":"CarnegieMellonU","fullname":"Carnegie Mellon University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/6I146aJvxxlRCEbYFFAeQ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.17766.md"}">
Papers
arxiv:2605.17766

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

Published on May 18
· Submitted by
Jindong Wang
on May 25
Authors:
,
,
,
,

Abstract

LatentUMM addresses multimodal consistency issues by constructing an enhanced shared latent space that explicitly aligns transformations between modalities and stabilizes latent dynamics during generation and re-encoding processes.

AI-generated summary

Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.

Community

Paper submitter about 2 hours ago

Latent alignment for UMMs

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.17766
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.17766 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.17766 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.17766 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers