Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.</p>\n","updatedAt":"2026-05-27T02:27:41.787Z","author":{"_id":"673c7319d11b1c2e246ead9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg","fullname":"Yang Shi","name":"DogNeverSleep","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":13,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8289926052093506},"editors":["DogNeverSleep"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26244","authors":[{"_id":"6a16565de9aa3c8e322db46a","name":"Tengfei Liu","hidden":false},{"_id":"6a16565de9aa3c8e322db46b","user":{"_id":"673c7319d11b1c2e246ead9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg","isPro":false,"fullname":"Yang Shi","user":"DogNeverSleep","type":"user","name":"DogNeverSleep"},"name":"Yang Shi","status":"claimed_verified","statusLastChangedAt":"2026-05-27T07:41:11.812Z","hidden":false},{"_id":"6a16565de9aa3c8e322db46c","user":{"_id":"644d2532d185572dd1e48f90","avatarUrl":"/avatars/5831acebb02d8bc8f80f56b7b11c7c69.svg","isPro":false,"fullname":"Zhu","user":"zzzhu","type":"user","name":"zzzhu"},"name":"Xuanyu Zhu","status":"claimed_verified","statusLastChangedAt":"2026-05-27T07:41:09.539Z","hidden":false},{"_id":"6a16565de9aa3c8e322db46d","name":"Jiafu Tang","hidden":false},{"_id":"6a16565de9aa3c8e322db46e","name":"Liu Yang","hidden":false},{"_id":"6a16565de9aa3c8e322db46f","name":"Qixun Wang","hidden":false},{"_id":"6a16565de9aa3c8e322db470","name":"Zhuoran Zhang","hidden":false},{"_id":"6a16565de9aa3c8e322db471","name":"Yuqi Tang","hidden":false},{"_id":"6a16565de9aa3c8e322db472","name":"Fengxiang Wang","hidden":false},{"_id":"6a16565de9aa3c8e322db473","name":"Yuhao Dong","hidden":false},{"_id":"6a16565de9aa3c8e322db474","name":"Xinlong Chen","hidden":false},{"_id":"6a16565de9aa3c8e322db475","name":"Bozhou Li","hidden":false},{"_id":"6a16565de9aa3c8e322db476","name":"Bohan Zeng","hidden":false},{"_id":"6a16565de9aa3c8e322db477","name":"Yue Ding","hidden":false},{"_id":"6a16565de9aa3c8e322db478","name":"Xiaohan Zhang","hidden":false},{"_id":"6a16565de9aa3c8e322db479","name":"Jialu Chen","hidden":false},{"_id":"6a16565de9aa3c8e322db47a","name":"Haotian Wang","hidden":false},{"_id":"6a16565de9aa3c8e322db47b","name":"Yuanxing Zhang","hidden":false},{"_id":"6a16565de9aa3c8e322db47c","name":"Pengfei Wan","hidden":false},{"_id":"6a16565de9aa3c8e322db47d","name":"Leye Wang","hidden":false}],"publishedAt":"2026-05-25T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV","submittedOnDailyBy":{"_id":"673c7319d11b1c2e246ead9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg","isPro":false,"fullname":"Yang Shi","user":"DogNeverSleep","type":"user","name":"DogNeverSleep"},"summary":"Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.","upvotes":30,"discussionId":"6a16565de9aa3c8e322db47e","githubRepo":"https://github.com/pkucs-Ltf/LongAV-Compass","githubRepoAddedBy":"user","ai_summary":"LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences.","ai_keywords":["audio-visual generation","benchmark","evaluation framework","MLLM-assisted assessment","multimodal metrics","DINO-v2","ArcFace","CLIP","ImageBind","temporal consistency","narrative coherence","audio-visual synchronization"],"githubStars":4,"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"673c7319d11b1c2e246ead9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg","isPro":false,"fullname":"Yang Shi","user":"DogNeverSleep","type":"user"},{"_id":"64241749a05235e2f8d34cb0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64241749a05235e2f8d34cb0/o6CY4xS22W8_DIqesFykM.jpeg","isPro":false,"fullname":"Yuanxing Zhang","user":"LongoXC","type":"user"},{"_id":"66adec6a9c381f5492e4745e","avatarUrl":"/avatars/25b4a9c83ffa4125fecb6a2b9ce93ee0.svg","isPro":false,"fullname":"kevintank666","user":"kevintank","type":"user"},{"_id":"68d537ea1d2ee6800f0b57e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68d537ea1d2ee6800f0b57e6/9D2Dnwz_NIyHhYyRQk4nC.jpeg","isPro":false,"fullname":"vicky","user":"Vickyinmyheart824","type":"user"},{"_id":"644d2532d185572dd1e48f90","avatarUrl":"/avatars/5831acebb02d8bc8f80f56b7b11c7c69.svg","isPro":false,"fullname":"Zhu","user":"zzzhu","type":"user"},{"_id":"66650d38b52f0890724f3b07","avatarUrl":"/avatars/c25a365bff4985ebb71c96dd097b804f.svg","isPro":false,"fullname":"Xinlong Chen","user":"XinlongChen","type":"user"},{"_id":"652965773a416e1f2173443b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652965773a416e1f2173443b/y9MB8YgHzbwCXAc4EI9T3.jpeg","isPro":true,"fullname":"Yuhao Dong","user":"THUdyh","type":"user"},{"_id":"6753d7a233e094f843030cf1","avatarUrl":"/avatars/86398855cb089a40510cc2d18d8cab00.svg","isPro":false,"fullname":"Liu","user":"TengfeiLiuCoder","type":"user"},{"_id":"67172ea95331e4bdf9592447","avatarUrl":"/avatars/eda10dc013e41588de3bb02a07d078a8.svg","isPro":false,"fullname":"kemorebbi","user":"kemorebi","type":"user"},{"_id":"65e2e93bfcaff433f7a87b43","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65e2e93bfcaff433f7a87b43/1lB6Cpkvpj-3e-7wnMKmR.jpeg","isPro":false,"fullname":"Qixun Wang","user":"NOVAglow646","type":"user"},{"_id":"660781a450d2b7a71091240d","avatarUrl":"/avatars/da9439b8920605d8427893d0ebc32dfa.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbh0217","type":"user"},{"_id":"66100bacac50abb8d56dece6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66100bacac50abb8d56dece6/fd-4VMpb_1nl903yAIK4K.jpeg","isPro":false,"fullname":"Ding Yue","user":"dingyue1011","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26244.md"}">
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
LongAV-Compass is a comprehensive benchmark for evaluating minute-long audio-visual generation across multiple modalities, assessing quality, consistency, and alignment over extended temporal sequences.
AI-generated summary
Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.
Community
Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.26244 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.26244 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.26244 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.