Hugging Face Daily Papers · · 5 min read

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis.</p>\n<p>We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input.</p>\n<p>Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.</p>\n","updatedAt":"2026-05-22T09:23:03.163Z","author":{"_id":"64ae1f92b575c5e272217ea3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg","fullname":"Zijie Xin","name":"xxayt","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8715401887893677},"editors":["xxayt"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18577","authors":[{"_id":"6a0c168f8ca2d0b2563805d4","name":"Ruixiang Zhao","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d5","name":"Jie Yang","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d6","name":"Zijie Xin","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d7","name":"Tianyi Wang","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d8","name":"Fengyun Rao","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d9","name":"Jing LYU","hidden":false},{"_id":"6a0c168f8ca2d0b2563805da","name":"Xirong Li","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding","submittedOnDailyBy":{"_id":"64ae1f92b575c5e272217ea3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg","isPro":false,"fullname":"Zijie Xin","user":"xxayt","type":"user","name":"xxayt"},"summary":"Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.","upvotes":1,"discussionId":"6a0c16908ca2d0b2563805db","projectPage":"https://ruixiangzhao.github.io/OmniPro/","githubRepo":"https://github.com/RuixiangZhao/OmniPro","githubRepoAddedBy":"user","ai_summary":"OmniPro is introduced as the first benchmark for evaluating omni-modal large language models' proactive streaming video understanding, featuring diverse tasks and dual-mode evaluation protocols.","ai_keywords":["omni-modal large language models","proactive streaming video understanding","multimodal analysis","dual-mode evaluation protocol","Probe mode","Online mode"],"githubStars":5},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64ae1f92b575c5e272217ea3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg","isPro":false,"fullname":"Zijie Xin","user":"xxayt","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18577.md"}">
Papers
arxiv:2605.18577

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Published on May 18
· Submitted by
Zijie Xin
on May 22
Authors:
,
,
,
,
,
,

Abstract

OmniPro is introduced as the first benchmark for evaluating omni-modal large language models' proactive streaming video understanding, featuring diverse tasks and dual-mode evaluation protocols.

AI-generated summary

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.

Community

Paper submitter about 3 hours ago

We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis.

We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input.

Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.18577
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.18577 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.18577 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers