Hugging Face Daily Papers · May 29, 2026 · 6 min read

Native Audio-Visual Alignment for Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

NAVA is a 6.3B-parameter Native Audio-Visual Alignment framework for joint audio-video generation. To overcome the limitations of existing dual-tower and unified paradigms, NAVA employs an Align-then-Fuse MMDiT architecture that first establishes fine-grained audio-video correspondence before applying textual context. Furthermore, it introduces Timbre-in-Context Conditioning for highly controllable speech generation. Experiments show NAVA achieves superior A-V synchronization, robust video quality, and enhanced reference-timbre controllability on Verse-Bench and Seed-TTS.\n","updatedAt":"2026-05-29T01:38:17.905Z","author":{"_id":"65cf859f88d13d8128bb8545","avatarUrl":"/avatars/aa18b993bd90d9c8a95913050cd955a8.svg","fullname":"Longbin Ji","name":"robingg1","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7626951336860657},"editors":["robingg1"],"editorAvatarUrls":["/avatars/aa18b993bd90d9c8a95913050cd955a8.svg"],"reactions":[],"isReport":false}},{"id":"6a1a415fdf1f8833acea4b06","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:46:07.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation](https://huggingface.co/papers/2605.08729) (2026)\n* [SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing](https://huggingface.co/papers/2605.25193) (2026)\n* [Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation](https://huggingface.co/papers/2605.17488) (2026)\n* [MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation](https://huggingface.co/papers/2604.19679) (2026)\n* [Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling](https://huggingface.co/papers/2604.23586) (2026)\n* [Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation](https://huggingface.co/papers/2605.25195) (2026)\n* [StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration](https://huggingface.co/papers/2605.25659) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.08729\">Unison: Harmonizing Motion, Speech, and Sound for Human-Centric Audio-Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25193\">SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.17488\">Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.19679\">MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.23586\">Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25195\">Baton: Explicit Semantic Blueprints for Joint Video-Audio Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25659\">StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-30T01:46:07.212Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6603249311447144},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30073","authors":[{"_id":"6a18eb9b56b4bb14ec65cdb5","user":{"_id":"65cf859f88d13d8128bb8545","avatarUrl":"/avatars/aa18b993bd90d9c8a95913050cd955a8.svg","isPro":false,"fullname":"Longbin Ji","user":"robingg1","type":"user","name":"robingg1"},"name":"Longbin Ji","status":"claimed_verified","statusLastChangedAt":"2026-05-29T09:31:58.826Z","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdb6","name":"Guan Wang","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdb7","name":"Xuan Wei","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdb8","name":"Chenye Yang","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdb9","name":"Xiangrui Liu","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdba","user":{"_id":"67f37f78b36e82d366dedeec","avatarUrl":"/avatars/678bb5891d5c2e80edc0799d2308a5d3.svg","isPro":false,"fullname":"Max Zhenyu Zhang","user":"max-zhenyu-zhang","type":"user","name":"max-zhenyu-zhang"},"name":"Zhenyu Zhang","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:54:32.866Z","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdbb","name":"Shuohuan Wang","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdbc","name":"Yu Sun","hidden":false},{"_id":"6a18eb9b56b4bb14ec65cdbd","name":"Jingzhou He","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"Native Audio-Visual Alignment for Generation","submittedOnDailyBy":{"_id":"65cf859f88d13d8128bb8545","avatarUrl":"/avatars/aa18b993bd90d9c8a95913050cd955a8.svg","isPro":false,"fullname":"Longbin Ji","user":"robingg1","type":"user","name":"robingg1"},"summary":"Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.","upvotes":22,"discussionId":"6a18eb9c56b4bb14ec65cdbe","projectPage":"https://ernie-research.github.io/NAVA/","githubRepo":"https://github.com/ernie-research/NAVA","githubRepoAddedBy":"user","ai_summary":"NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.","ai_keywords":["joint audio-video generation","dual-tower designs","tri-modal designs","posterior alignment","unified tri-modal designs","modality-aware alignment","modality-shared denoising","MMDiT architecture","Timbre-in-Context Conditioning","reference timbre cues","audio-visual synchronization"],"githubStars":51,"organization":{"_id":"626a6d6b4909b521e1f59ce5","name":"baidu","fullname":"BAIDU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f187a2cc1c03340ac30498/TYYUxK8xD1AxExFMWqbZD.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65cf859f88d13d8128bb8545","avatarUrl":"/avatars/aa18b993bd90d9c8a95913050cd955a8.svg","isPro":false,"fullname":"Longbin Ji","user":"robingg1","type":"user"},{"_id":"67f37f78b36e82d366dedeec","avatarUrl":"/avatars/678bb5891d5c2e80edc0799d2308a5d3.svg","isPro":false,"fullname":"Max Zhenyu Zhang","user":"max-zhenyu-zhang","type":"user"},{"_id":"642c2dcec3694d2b74565c48","avatarUrl":"/avatars/31243bb505f8c511ebd7492eaf3ea1a9.svg","isPro":false,"fullname":"zhangzef","user":"Starrrrrry","type":"user"},{"_id":"68efc1f0c8680f01bcbf3bbf","avatarUrl":"/avatars/2d16f8ac19a7fcb769ed067b658e3dcb.svg","isPro":false,"fullname":"Ruihang Li","user":"rhli","type":"user"},{"_id":"67108739243baf4c5805b5fe","avatarUrl":"/avatars/1f75f8b27c41b164d262043f60b41bbd.svg","isPro":false,"fullname":"everks","user":"everks","type":"user"},{"_id":"62cd9632342b1d5dab8df4c3","avatarUrl":"/avatars/9080d20bb57a05a1eeb6800eba886cf9.svg","isPro":false,"fullname":"Junyuan Shang","user":"sjy1203","type":"user"},{"_id":"63202a4af7db36538c9fe3ba","avatarUrl":"/avatars/d270cfb9c22183e4edc3a91bd12ce66a.svg","isPro":false,"fullname":"Shuohuan Wang","user":"wangshuohuan","type":"user"},{"_id":"698419de94015f1e5eedacec","avatarUrl":"/avatars/e80baa6f9efcd5e5d7cc9b93ac852c7b.svg","isPro":false,"fullname":"Guan Wang","user":"guanw-pku","type":"user"},{"_id":"6563f95c7007bdfe51decfe7","avatarUrl":"/avatars/05ac530daca8fa4e1dd84566d7206a39.svg","isPro":false,"fullname":"Zhang Yichen","user":"Tttizi","type":"user"},{"_id":"6589b61dbfdd9f4410af9b7d","avatarUrl":"/avatars/7d644e750f6084d4f24e332135cc5be8.svg","isPro":false,"fullname":"Hu","user":"Irving1","type":"user"},{"_id":"64ec6025b96ff0e175728ac0","avatarUrl":"/avatars/a1acbe22b5e5105703e7912da2cfced2.svg","isPro":false,"fullname":"hxz","user":"CUDAOUTOFMEMORY","type":"user"},{"_id":"65ae2b972582acc6360ad3ae","avatarUrl":"/avatars/2632e8c0c3d51019ca81e595abfeb118.svg","isPro":false,"fullname":"du jingdong","user":"wallejd","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"626a6d6b4909b521e1f59ce5","name":"baidu","fullname":"BAIDU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/64f187a2cc1c03340ac30498/TYYUxK8xD1AxExFMWqbZD.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30073.md"}">

Papers

arxiv:2605.30073

Native Audio-Visual Alignment for Generation

Published on May 28

· Submitted by

Longbin Ji on May 29

BAIDU

Upvote

Authors:

Longbin Ji ,

Zhenyu Zhang ,

Abstract

NAVA enables joint audio-video generation with improved synchronization and controllability through native audio-visual alignment and context-conditioned denoising.

AI-generated summary

Joint audio-video generation aims to synthesize temporally synchronized and semantically coherent visual-acoustic content. However, existing open-source methods mainly rely on either dual-tower designs with posterior alignment or fully unified tri-modal designs that mix textual context, audio and video in one shared space. The former weakens fine-grained audio-video co-evolution, while the latter couples semantic conditioning with low-level synchronization. To address these limitations, we propose NAVA, a Native Audio-Visual Alignment framework for joint audio-video generation. NAVA is built upon context-conditioned native audio-visual alignment: it first establishes audio-video correspondence in a dedicated interaction space, and then uses external context to condition the joint denoising process. Specifically, NAVA is instantiated with an Align-then-Fuse MMDiT architecture, which transitions from modality-aware audio-video alignment to modality-shared joint denoising. Furthermore, we introduce Timbre-in-Context Conditioning to associate reference timbre cues with corresponding speech spans to achieve controllable speech timbre. Experiments on Verse-Bench and Seed-TTS, together with a user study, demonstrate that NAVA achieves superior video quality, precise audio-visual synchronization, competitive audio quality, and stronger reference-timbre controllability using only 6.3B parameters.