NAVA is a 6.3B-parameter Native Audio-Visual Alignment framework for joint audio-video generation. To overcome the limitations of existing dual-tower and unified paradigms, NAVA employs an Align-then-Fuse MMDiT architecture that first establishes fine-grained audio-video correspondence before applying textual context. Furthermore, it introduces Timbre-in-Context Conditioning for highly controllable speech generation. Experiments show NAVA achieves superior A-V synchronization, robust video quality, and enhanced reference-timbre controllability on Verse-Bench and Seed-TTS.</p>\n","updatedAt":"2026-05-29T01:38:17.905Z","author":{"_id":"65cf859f88d13d8128bb8545","avatarUrl":"/avatars/aa18b993bd90d9c8a95913050cd955a8.svg","fullname":"Longbin Ji","name":"robingg1","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7626951336860657},"editors":["robingg1"],"editorAvatarUrls":["/avatars/aa18b993bd90d9c8a95913050cd955a8.svg"],"reactions":[],"isReport":false}},{"id":"6a1a415fdf1f8833acea4b06","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:46:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation](https://huggingface.co/papers/2605.08729) (2026)\n* [SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing](https://huggingface.co/papers/2605.25193) (2026)\n* [Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation](https://huggingface.co/papers/2605.17488) (2026)\n* [MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation](https://huggingface.co/papers/2604.19679) (2026)\n* [Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling](https://huggingface.co/papers/2604.23586) (2026)\n* [Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation](https://huggingface.co/papers/2605.25195) (2026)\n* [StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration](https://huggingface.co/papers/2605.25659) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.08729\">Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25193\">SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.17488\">Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.19679\">MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.23586\">Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25195\">Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25659\">StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:46:07.212Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6603249311447144},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30073","authors":[{"_id":"6a18eb9b56b4bb14ec65cdb5","user":{"_id":"65cf859f88d13d8128bb8545","avatarUrl":"/avatars/aa18b993bd90d9c8a95913050cd955a8.svg","isPro":false,"fullname":"Longbin Ji","user":"robingg1","type":"user","name":"robingg1"},"name":"Longbin Ji","status":"claimed_verified","statusLastChangedAt":"2026-05-29T09:31:58.826Z","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdb6","name":"Guan Wang","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdb7","name":"Xuan Wei","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdb8","name":"Chenye Yang","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdb9","name":"Xiangrui Liu","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdba","user":{"_id":"67f37f78b36e82d366dedeec","avatarUrl":"/avatars/678bb5891d5c2e80edc0799d2308a5d3.svg","isPro":false,"fullname":"Max Zhenyu Zhang","user":"max-zhenyu-zhang","type":"user","name":"max-zhenyu-zhang"},"name":"Zhenyu Zhang","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:54:32.866Z","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdbb","name":"Shuohuan Wang","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdbc","name":"Yu Sun","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdbd","name":"Jingzhou He","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"Native Audio-Visual Alignment for Generation","submittedOnDailyBy":{"_id":"65cf859f88d13d8128bb8545","avatarUrl":"/avatars/aa18b993bd90d9c8a95913050cd955a8.svg","isPro":false,"fullname":"Longbin Ji","user":"robingg1","type":"user","name":"robingg1"},"summary":"Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.","upvotes":22,"discussionId":"6a18eb9c56b4bb14ec65cdbe","projectPage":"https://ernie-research.github.io/NAVA/","githubRepo":"https://github.com/ernie-research/NAVA","githubRepoAddedBy":"user","ai_summary":"NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.","ai_keywords":["joint audio-video generation","dual-tower designs","tri-modal designs","posterior alignment","unified tri-modal designs","modality-aware alignment","modality-shared denoising","MMDiT architecture","Timbre-in-Context Conditioning","reference timbre cues","audio-visual synchronization"],"githubStars":51,"organization":{"_id":"626a6d6b4909b521e1f59ce5","name":"baidu","fullname":"BAIDU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f187a2cc1c03340ac30498/TYYUxK8xD1AxExFMWqbZD.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65cf859f88d13d8128bb8545","avatarUrl":"/avatars/aa18b993bd90d9c8a95913050cd955a8.svg","isPro":false,"fullname":"Longbin Ji","user":"robingg1","type":"user"},{"_id":"67f37f78b36e82d366dedeec","avatarUrl":"/avatars/678bb5891d5c2e80edc0799d2308a5d3.svg","isPro":false,"fullname":"Max Zhenyu Zhang","user":"max-zhenyu-zhang","type":"user"},{"_id":"642c2dcec3694d2b74565c48","avatarUrl":"/avatars/31243bb505f8c511ebd7492eaf3ea1a9.svg","isPro":false,"fullname":"zhangzef","user":"Starrrrrry","type":"user"},{"_id":"68efc1f0c8680f01bcbf3bbf","avatarUrl":"/avatars/2d16f8ac19a7fcb769ed067b658e3dcb.svg","isPro":false,"fullname":"Ruihang Li","user":"rhli","type":"user"},{"_id":"67108739243baf4c5805b5fe","avatarUrl":"/avatars/1f75f8b27c41b164d262043f60b41bbd.svg","isPro":false,"fullname":"everks","user":"everks","type":"user"},{"_id":"62cd9632342b1d5dab8df4c3","avatarUrl":"/avatars/9080d20bb57a05a1eeb6800eba886cf9.svg","isPro":false,"fullname":"Junyuan Shang","user":"sjy1203","type":"user"},{"_id":"63202a4af7db36538c9fe3ba","avatarUrl":"/avatars/d270cfb9c22183e4edc3a91bd12ce66a.svg","isPro":false,"fullname":"Shuohuan Wang","user":"wangshuohuan","type":"user"},{"_id":"698419de94015f1e5eedacec","avatarUrl":"/avatars/e80baa6f9efcd5e5d7cc9b93ac852c7b.svg","isPro":false,"fullname":"Guan Wang","user":"guanw-pku","type":"user"},{"_id":"6563f95c7007bdfe51decfe7","avatarUrl":"/avatars/05ac530daca8fa4e1dd84566d7206a39.svg","isPro":false,"fullname":"Zhang Yichen","user":"Tttizi","type":"user"},{"_id":"6589b61dbfdd9f4410af9b7d","avatarUrl":"/avatars/7d644e750f6084d4f24e332135cc5be8.svg","isPro":false,"fullname":"Hu","user":"Irving1","type":"user"},{"_id":"64ec6025b96ff0e175728ac0","avatarUrl":"/avatars/a1acbe22b5e5105703e7912da2cfced2.svg","isPro":false,"fullname":"hxz","user":"CUDAOUTOFMEMORY","type":"user"},{"_id":"65ae2b972582acc6360ad3ae","avatarUrl":"/avatars/2632e8c0c3d51019ca81e595abfeb118.svg","isPro":false,"fullname":"du jingdong","user":"wallejd","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"626a6d6b4909b521e1f59ce5","name":"baidu","fullname":"BAIDU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f187a2cc1c03340ac30498/TYYUxK8xD1AxExFMWqbZD.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30073.md"}">
Native Audio-Visual Alignment for Generation
Abstract
NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.
AI-generated summary
Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.
Community
NAVA is a 6.3B-parameter Native Audio-Visual Alignment framework for joint audio-video generation. To overcome the limitations of existing dual-tower and unified paradigms, NAVA employs an Align-then-Fuse MMDiT architecture that first establishes fine-grained audio-video correspondence before applying textual context. Furthermore, it introduces Timbre-in-Context Conditioning for highly controllable speech generation. Experiments show NAVA achieves superior A-V synchronization, robust video quality, and enhanced reference-timbre controllability on Verse-Bench and Seed-TTS.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.30073 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.30073 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.