Paper link: <a href=\"https://arxiv.org/abs/2605.22012\" rel=\"nofollow\">https://arxiv.org/abs/2605.22012</a></p>\n","updatedAt":"2026-05-22T04:56:36.493Z","author":{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","fullname":"bohan zeng","name":"zbhpku","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":11,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7481658458709717},"editors":["zbhpku"],"editorAvatarUrls":["/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22012","authors":[{"_id":"6a0fe203a53a61ce2e422db3","name":"Yifan Dai","hidden":false},{"_id":"6a0fe203a53a61ce2e422db4","name":"Zhenhua Wu","hidden":false},{"_id":"6a0fe203a53a61ce2e422db5","name":"Bohan Zeng","hidden":false},{"_id":"6a0fe203a53a61ce2e422db6","name":"Daili Hua","hidden":false},{"_id":"6a0fe203a53a61ce2e422db7","name":"Jialing Liu","hidden":false},{"_id":"6a0fe203a53a61ce2e422db8","name":"Bozhou Li","hidden":false},{"_id":"6a0fe203a53a61ce2e422db9","name":"Yuran Wang","hidden":false},{"_id":"6a0fe203a53a61ce2e422dba","name":"Chengzhuo Tong","hidden":false},{"_id":"6a0fe203a53a61ce2e422dbb","name":"Hao Liang","hidden":false},{"_id":"6a0fe203a53a61ce2e422dbc","name":"Xiaochen Ma","hidden":false},{"_id":"6a0fe203a53a61ce2e422dbd","name":"Junbo Niu","hidden":false},{"_id":"6a0fe203a53a61ce2e422dbe","name":"Tianyu Guo","hidden":false},{"_id":"6a0fe203a53a61ce2e422dbf","name":"Yang Shi","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc0","name":"Yue Ding","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc1","name":"Yiyan Ji","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc2","name":"Bingyin Mei","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc3","name":"Yushuo Guan","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc4","name":"Yuanxing Zhang","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc5","name":"Pengfei Wan","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc6","name":"Fangcheng Fu","hidden":false},{"_id":"6a0fe203a53a61ce2e422dc7","name":"Wentao Zhang","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning","submittedOnDailyBy":{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","isPro":false,"fullname":"bohan zeng","user":"zbhpku","type":"user","name":"zbhpku"},"summary":"Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose LatentOmni, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct LatentOmni-Instruct-35K, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.","upvotes":30,"discussionId":"6a0fe204a53a61ce2e422dc8","githubRepo":"https://github.com/yfanDai/LatentOmni","githubRepoAddedBy":"user","ai_summary":"LatentOmni is a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states using feature-level supervision and temporal consistency embedding, outperforming explicit text-based chain-of-thought approaches in audio-visual reasoning tasks.","ai_keywords":["multimodal large language models","chain-of-thought","latent space","cross-modal reasoning","feature-level supervision","Omni-Sync Position Embedding","audio-visual reasoning","autoregressive generation","sensory information","temporal consistency"],"githubStars":5,"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","isPro":false,"fullname":"bohan zeng","user":"zbhpku","type":"user"},{"_id":"674e77fa59a127e4eacf5dba","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/674e77fa59a127e4eacf5dba/W7qr94Buvvaio8zhKrEha.jpeg","isPro":false,"fullname":"Yifan Dai","user":"Moonwines","type":"user"},{"_id":"690346f0b56cd28d8cd42081","avatarUrl":"/avatars/85af3f419a3c4a89fd8081e21c80a596.svg","isPro":false,"fullname":"ZhanpengShi","user":"Recqvq","type":"user"},{"_id":"648988cf195a6f09f62b95f3","avatarUrl":"/avatars/5237680958381b269512bb8c24cf8d2b.svg","isPro":false,"fullname":"ShanglinLi","user":"ShanglinHG","type":"user"},{"_id":"660781a450d2b7a71091240d","avatarUrl":"/avatars/da9439b8920605d8427893d0ebc32dfa.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbh0217","type":"user"},{"_id":"69ce390201d713064aea5864","avatarUrl":"/avatars/af3977aeb5432599fb6b576c3f64a46b.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbhpku1","type":"user"},{"_id":"66100bacac50abb8d56dece6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66100bacac50abb8d56dece6/fd-4VMpb_1nl903yAIK4K.jpeg","isPro":false,"fullname":"Ding Yue","user":"dingyue1011","type":"user"},{"_id":"65e71ef39cf349af2940b317","avatarUrl":"/avatars/fc1cd8d3510946fc947d67b16b51834b.svg","isPro":false,"fullname":"Yuran Wang","user":"Ryann829","type":"user"},{"_id":"65099d08f37afbab0d3fb268","avatarUrl":"/avatars/cef45b7c6b7c90bbef341a39a9bb51be.svg","isPro":false,"fullname":"Xiaochen Ma","user":"Sunnyhaze","type":"user"},{"_id":"650abbb71aece923f21d87fc","avatarUrl":"/avatars/f09ff031c278bc42bfd7a563853e142c.svg","isPro":false,"fullname":"Junbo Niu","user":"Niujunbo2002","type":"user"},{"_id":"6751a4fedf636b0140a9b873","avatarUrl":"/avatars/d75f7f6cfbfb4d646e0e557d1cfacdce.svg","isPro":false,"fullname":"Hao Liang","user":"lhpku20010120","type":"user"},{"_id":"6217599529500f41901123f8","avatarUrl":"/avatars/8a0fe54e53fe6527c70a78598a0cd941.svg","isPro":false,"fullname":"Hao Liang","user":"lhbit20010120","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22012.md"}">
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
LatentOmni is a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states using feature-level supervision and temporal consistency embedding, outperforming explicit text-based chain-of-thought approaches in audio-visual reasoning tasks.
AI-generated summary
Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose LatentOmni, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct LatentOmni-Instruct-35K, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.22012 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.22012 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.22012 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.