Hugging Face Daily Papers · May 22, 2026 · 5 min read

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis.\nWe further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input.\nEvaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.\n","updatedAt":"2026-05-22T09:23:03.163Z","author":{"_id":"64ae1f92b575c5e272217ea3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg","fullname":"Zijie Xin","name":"xxayt","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8715401887893677},"editors":["xxayt"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18577","authors":[{"_id":"6a0c168f8ca2d0b2563805d4","name":"Ruixiang Zhao","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d5","name":"Jie Yang","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d6","name":"Zijie Xin","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d7","name":"Tianyi Wang","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d8","name":"Fengyun Rao","hidden":false},{"_id":"6a0c168f8ca2d0b2563805d9","name":"Jing LYU","hidden":false},{"_id":"6a0c168f8ca2d0b2563805da","name":"Xirong Li","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding","submittedOnDailyBy":{"_id":"64ae1f92b575c5e272217ea3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg","isPro":false,"fullname":"Zijie Xin","user":"xxayt","type":"user","name":"xxayt"},"summary":"Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.","upvotes":1,"discussionId":"6a0c16908ca2d0b2563805db","projectPage":"https://ruixiangzhao.github.io/OmniPro/","githubRepo":"https://github.com/RuixiangZhao/OmniPro","githubRepoAddedBy":"user","ai_summary":"OmniPro is introduced as the first benchmark for evaluating omni-modal large language models' proactive streaming video understanding, featuring diverse tasks and dual-mode evaluation protocols.","ai_keywords":["omni-modal large language models","proactive streaming video understanding","multimodal analysis","dual-mode evaluation protocol","Probe mode","Online mode"],"githubStars":5},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64ae1f92b575c5e272217ea3","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64ae1f92b575c5e272217ea3/noAFNdaNW47QedntLO4N4.jpeg","isPro":false,"fullname":"Zijie Xin","user":"xxayt","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18577.md"}">

Papers

arxiv:2605.18577

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Published on May 18

· Submitted by

Zijie Xin on May 22

Upvote

Authors:

Abstract

OmniPro is introduced as the first benchmark for evaluating omni-modal large language models' proactive streaming video understanding, featuring diverse tasks and dual-mode evaluation protocols.

AI-generated summary

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.

View arXiv page View PDF Project page GitHub 5 Add to collection

Community

xxayt

Paper submitter about 3 hours ago

We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input.

Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.18577

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.18577 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.18577 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers