We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis.</p>\n<p>We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input.</p>\n<p>Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.</p>\n","updatedAt":"2026-05-22T09:23:03.163Z","author":{"_id":"64ae1f92b575c5e272217ea3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg","fullname":"Zijie Xin","name":"xxayt","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8715401887893677},"editors":["xxayt"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18577","authors":[{"_id":"6a0c168f8ca2d0b2563805d4","name":"Ruixiang Zhao","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d5","name":"Jie Yang","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d6","name":"Zijie Xin","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d7","name":"Tianyi Wang","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d8","name":"Fengyun Rao","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d9","name":"Jing LYU","hidden":false},{"_id":"6a0c168f8ca2d0b2563805da","name":"Xirong Li","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding","submittedOnDailyBy":{"_id":"64ae1f92b575c5e272217ea3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg","isPro":false,"fullname":"Zijie Xin","user":"xxayt","type":"user","name":"xxayt"},"summary":"Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.","upvotes":1,"discussionId":"6a0c16908ca2d0b2563805db","projectPage":"https://ruixiangzhao.github.io/OmniPro/","githubRepo":"https://github.com/RuixiangZhao/OmniPro","githubRepoAddedBy":"user","ai_summary":"OmniPro is introduced as the first benchmark for evaluating omni-modal large language models' proactive streaming video understanding, featuring diverse tasks and dual-mode evaluation protocols.","ai_keywords":["omni-modal large language models","proactive streaming video understanding","multimodal analysis","dual-mode evaluation protocol","Probe mode","Online mode"],"githubStars":5},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64ae1f92b575c5e272217ea3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg","isPro":false,"fullname":"Zijie Xin","user":"xxayt","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18577.md"}">
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
Abstract
OmniPro is introduced as the first benchmark for evaluating omni-modal large language models' proactive streaming video understanding, featuring diverse tasks and dual-mode evaluation protocols.
AI-generated summary
Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.
Community
We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis.
We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input.
Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.18577 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.18577 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.