Hugging Face Daily Papers · May 22, 2026 · 3 min read

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Paper link: <a href=\"https://arxiv.org/abs/2605.22012\" rel=\"nofollow\">https://arxiv.org/abs/2605.22012</a></p>\n","updatedAt":"2026-05-22T04:56:36.493Z","author":{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","fullname":"bohan zeng","name":"zbhpku","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7481658458709717},"editors":["zbhpku"],"editorAvatarUrls":["/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22012","authors":[{"_id":"6a0fe203a53a61ce2e422db3","name":"Yifan Dai","hidden":false},{"_id":"6a0fe203a53a61ce2e422db4","name":"Zhenhua Wu","hidden":false},{"_id":"6a0fe203a53a61ce2e422db5","name":"Bohan Zeng","hidden":false},{"_id":"6a0fe203a53a61ce2e422db6","name":"Daili Hua","hidden":false},{"_id":"6a0fe203a53a61ce2e422db7","name":"Jialing Liu","hidden":false},{"_id":"6a0fe203a53a61ce2e422db8","name":"Bozhou Li","hidden":false},{"_id":"6a0fe203a53a61ce2e422db9","name":"Yuran Wang","hidden":false},{"_id":"6a0fe203a53a61ce2e422dba","name":"Chengzhuo Tong","hidden":false},{"_id":"6a0fe203a53a61ce2e422dbb","name":"Hao Liang","hidden":false},{"_id":"6a0fe203a53a61ce2e422dbc","name":"Xiaochen Ma","hidden":false},{"_id":"6a0fe203a53a61ce2e422dbd","name":"Junbo Niu","hidden":false},{"_id":"6a0fe203a53a61ce2e422dbe","name":"Tianyu Guo","hidden":false},{"_id":"6a0fe203a53a61ce2e422dbf","name":"Yang Shi","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc0","name":"Yue Ding","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc1","name":"Yiyan Ji","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc2","name":"Bingyin Mei","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc3","name":"Yushuo Guan","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc4","name":"Yuanxing Zhang","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc5","name":"Pengfei Wan","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc6","name":"Fangcheng Fu","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc7","name":"Wentao Zhang","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning","submittedOnDailyBy":{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","isPro":false,"fullname":"bohan zeng","user":"zbhpku","type":"user","name":"zbhpku"},"summary":"Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose LatentOmni, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct LatentOmni-Instruct-35K, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.","upvotes":30,"discussionId":"6a0fe204a53a61ce2e422dc8","githubRepo":"https://github.com/yfanDai/LatentOmni","githubRepoAddedBy":"user","ai_summary":"LatentOmni is a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states using feature-level supervision and temporal consistency embedding, outperforming explicit text-based chain-of-thought approaches in audio-visual reasoning tasks.","ai_keywords":["multimodal large language models","chain-of-thought","latent space","cross-modal reasoning","feature-level supervision","Omni-Sync Position Embedding","audio-visual reasoning","autoregressive generation","sensory information","temporal consistency"],"githubStars":5,"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","isPro":false,"fullname":"bohan zeng","user":"zbhpku","type":"user"},{"_id":"674e77fa59a127e4eacf5dba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/674e77fa59a127e4eacf5dba/W7qr94Buvvaio8zhKrEha.jpeg","isPro":false,"fullname":"Yifan Dai","user":"Moonwines","type":"user"},{"_id":"690346f0b56cd28d8cd42081","avatarUrl":"/avatars/85af3f419a3c4a89fd8081e21c80a596.svg","isPro":false,"fullname":"ZhanpengShi","user":"Recqvq","type":"user"},{"_id":"648988cf195a6f09f62b95f3","avatarUrl":"/avatars/5237680958381b269512bb8c24cf8d2b.svg","isPro":false,"fullname":"ShanglinLi","user":"ShanglinHG","type":"user"},{"_id":"660781a450d2b7a71091240d","avatarUrl":"/avatars/da9439b8920605d8427893d0ebc32dfa.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbh0217","type":"user"},{"_id":"69ce390201d713064aea5864","avatarUrl":"/avatars/af3977aeb5432599fb6b576c3f64a46b.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbhpku1","type":"user"},{"_id":"66100bacac50abb8d56dece6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66100bacac50abb8d56dece6/fd-4VMpb_1nl903yAIK4K.jpeg","isPro":false,"fullname":"Ding Yue","user":"dingyue1011","type":"user"},{"_id":"65e71ef39cf349af2940b317","avatarUrl":"/avatars/fc1cd8d3510946fc947d67b16b51834b.svg","isPro":false,"fullname":"Yuran Wang","user":"Ryann829","type":"user"},{"_id":"65099d08f37afbab0d3fb268","avatarUrl":"/avatars/cef45b7c6b7c90bbef341a39a9bb51be.svg","isPro":false,"fullname":"Xiaochen Ma","user":"Sunnyhaze","type":"user"},{"_id":"650abbb71aece923f21d87fc","avatarUrl":"/avatars/f09ff031c278bc42bfd7a563853e142c.svg","isPro":false,"fullname":"Junbo Niu","user":"Niujunbo2002","type":"user"},{"_id":"6751a4fedf636b0140a9b873","avatarUrl":"/avatars/d75f7f6cfbfb4d646e0e557d1cfacdce.svg","isPro":false,"fullname":"Hao Liang","user":"lhpku20010120","type":"user"},{"_id":"6217599529500f41901123f8","avatarUrl":"/avatars/8a0fe54e53fe6527c70a78598a0cd941.svg","isPro":false,"fullname":"Hao Liang","user":"lhbit20010120","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22012.md"}">

Papers

arxiv:2605.22012

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

Published on May 21

· Submitted by

bohan zeng on May 22

Kling Team

Upvote

Authors:

Abstract

LatentOmni is a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states using feature-level supervision and temporal consistency embedding, outperforming explicit text-based chain-of-thought approaches in audio-visual reasoning tasks.

AI-generated summary

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose LatentOmni, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct LatentOmni-Instruct-35K, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.