Hugging Face Daily Papers · May 29, 2026 · 6 min read

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

OmniInteract is a streaming benchmark for real-time omnimodal LLMs, evaluated through their native online inference over continuous audio-visual streams. User queries and ambient sounds live in the audio track, visual events live in the video, and models must decide whether, when, and what to respond — all without lookahead to future content.\n","updatedAt":"2026-05-29T03:02:10.437Z","author":{"_id":"642e686bbe01b88c9446db8b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e686bbe01b88c9446db8b/tb1DKe5xt50ykOeXiUuTE.jpeg","fullname":"Lu Xudong","name":"lucky-lance","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8956133723258972},"editors":["lucky-lance"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/642e686bbe01b88c9446db8b/tb1DKe5xt50ykOeXiUuTE.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1a406dff93f365649edd7f","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:42:05.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction](https://huggingface.co/papers/2605.17360) (2026)\n* [AURA: Always-On Understanding and Real-Time Assistance via Video Streams](https://huggingface.co/papers/2604.04184) (2026)\n* [StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering](https://huggingface.co/papers/2605.25621) (2026)\n* [MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction](https://huggingface.co/papers/2604.27393) (2026)\n* [IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams](https://huggingface.co/papers/2605.27074) (2026)\n* [VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models](https://huggingface.co/papers/2604.07634) (2026)\n* [EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant](https://huggingface.co/papers/2605.10343) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.17360\">Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.04184\">AURA: Always-On Understanding and Real-Time Assistance via Video Streams</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25621\">StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.27393\">MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.27074\">IPIBench: Evaluating Interactive Proactive Intelligence of MLLMs under Continuous Streams</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.07634\">VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.10343\">EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-30T01:42:05.650Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6918468475341797},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.26485","authors":[{"_id":"6a18fcbf56b4bb14ec65ced2","name":"Xudong Lu","hidden":false},{"_id":"6a18fcbf56b4bb14ec65ced3","name":"Xueying Li","hidden":false},{"_id":"6a18fcbf56b4bb14ec65ced4","name":"Annan Wang","hidden":false},{"_id":"6a18fcbf56b4bb14ec65ced5","name":"Yang Bo","hidden":false},{"_id":"6a18fcbf56b4bb14ec65ced6","name":"Jinpeng Chen","hidden":false},{"_id":"6a18fcbf56b4bb14ec65ced7","name":"Zengliang Li","hidden":false},{"_id":"6a18fcbf56b4bb14ec65ced8","name":"Nianzu Yang","hidden":false},{"_id":"6a18fcbf56b4bb14ec65ced9","name":"Rui Liu","hidden":false},{"_id":"6a18fcbf56b4bb14ec65ceda","name":"Xue Yang","hidden":false},{"_id":"6a18fcbf56b4bb14ec65cedb","name":"Jingwen Hou","hidden":false},{"_id":"6a18fcbf56b4bb14ec65cedc","name":"Hongsheng Li","hidden":false}],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants","submittedOnDailyBy":{"_id":"642e686bbe01b88c9446db8b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e686bbe01b88c9446db8b/tb1DKe5xt50ykOeXiUuTE.jpeg","isPro":true,"fullname":"Lu Xudong","user":"lucky-lance","type":"user","name":"lucky-lance"},"summary":"We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.","upvotes":1,"discussionId":"6a18fcbf56b4bb14ec65cedd","githubRepo":"https://github.com/Lucky-Lance/OmniInteract","githubRepoAddedBy":"user","ai_summary":"OmniInteract presents a streaming benchmark for real-time omnimodal large language models that evaluates online audio-visual processing with temporal grounding and interactive response requirements.","ai_keywords":["streaming benchmark","omnimodal large language models","audio-visual streams","online inference","multimodal triggers","response timing","temporally grounded response slots","1Q1A slots","1QnA slots","Interaction-Aware Quality-Timeliness F1","Interruption Diagnostic Suite","Nested Chain Completion Score"],"githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"642e686bbe01b88c9446db8b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/642e686bbe01b88c9446db8b/tb1DKe5xt50ykOeXiUuTE.jpeg","isPro":true,"fullname":"Lu Xudong","user":"lucky-lance","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.26485.md"}">

Papers

arxiv:2605.26485

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Published on May 26

· Submitted by

Lu Xudong on May 29

Upvote

Authors:

Abstract

OmniInteract presents a streaming benchmark for real-time omnimodal large language models that evaluates online audio-visual processing with temporal grounding and interactive response requirements.

AI-generated summary

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

View arXiv page View PDF GitHub 4 Add to collection

Community

lucky-lance

Paper submitter 1 day ago

librarian-bot

about 13 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.26485

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.26485 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.26485 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers