Hugging Face Daily Papers · June 2, 2026 · 5 min read

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We introduce SVI-bench, a large-scale benchmark for Strategic Video Intelligence, using sports as a testbed of real-world, multi-agent “microworlds” to test whether models can progress from perception to causal reasoning, strategic simulation, and agentic evidence synthesis. Across nine tasks built from aligned video, play-by-play logs, commentary, reports, and statistics, our evaluation reveals a sharp capability cliff: current models handle localized perception reasonably well but struggle significantly with reasoning, simulation, and especially autonomous, cross-corpus analysis.</p>\n","updatedAt":"2026-06-02T16:25:52.271Z","author":{"_id":"653ea065e8ed050cb3426a0e","avatarUrl":"/avatars/9092f4e4526a3ac2248b3e0c52ef1c68.svg","fullname":"Yulu Pan","name":"yulupan","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9080386161804199},"editors":["yulupan"],"editorAvatarUrls":["/avatars/9092f4e4526a3ac2248b3e0c52ef1c68.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31529","authors":[{"_id":"6a1d6fd1808ddbc3c7d43809","user":{"_id":"653ea065e8ed050cb3426a0e","avatarUrl":"/avatars/9092f4e4526a3ac2248b3e0c52ef1c68.svg","isPro":false,"fullname":"Yulu Pan","user":"yulupan","type":"user","name":"yulupan"},"name":"Yulu Pan","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:10:17.064Z","hidden":false},{"_id":"6a1d6fd1808ddbc3c7d4380a","user":{"_id":"645b0ca87d8ec84b7b5ecb5d","avatarUrl":"/avatars/bccbac67d06ed71108609948e86c8839.svg","isPro":false,"fullname":"Alex Yi","user":"Alexhimself","type":"user","name":"Alexhimself"},"name":"Han Yi","status":"claimed_verified","statusLastChangedAt":"2026-06-02T12:11:11.819Z","hidden":false},{"_id":"6a1d6fd1808ddbc3c7d4380b","name":"Seongsu Ha","hidden":false},{"_id":"6a1d6fd1808ddbc3c7d4380c","name":"Md Mohaiminul Islam","hidden":false},{"_id":"6a1d6fd1808ddbc3c7d4380d","name":"Benjamin Zhang","hidden":false},{"_id":"6a1d6fd1808ddbc3c7d4380e","name":"Lorenzo Torresani","hidden":false},{"_id":"6a1d6fd1808ddbc3c7d4380f","name":"Gedas Bertasius","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence","submittedOnDailyBy":{"_id":"653ea065e8ed050cb3426a0e","avatarUrl":"/avatars/9092f4e4526a3ac2248b3e0c52ef1c68.svg","isPro":false,"fullname":"Yulu Pan","user":"yulupan","type":"user","name":"yulupan"},"summary":"True video intelligence demands more than recognizing what is visible: it requires reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. We refer to this progression, from perception through causal reasoning and simulation to strategic planning, as Strategic Video Intelligence (SVI). No existing benchmark evaluates this capability stack: in-the-wild videos lack verifiable ground truth for causal and strategic questions, while synthetic environments sacrifice the complexity of real multi-agent systems. To bridge this gap, we introduce SVI-Bench, a large-scale benchmark that leverages team sports as a dynamic microworld, combining the complexity of real-world multi-agent interaction (10-22 agents making coordinated decisions under adversarial pressure) with the verifiability of explicit rules and definitive outcomes. SVI-Bench comprises approximately 35K hours of broadcast video, 15M annotated actions, 15K hours of expert commentary, 23K game reports, and 103K structured statistical records across basketball, soccer, and hockey, all constructed via a data engine that transforms raw game data into a dense, cross-referenced corpus. We organize evaluation into 9 tasks spanning a progressive four-pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agentic Synthesis. Evaluating strong multimodal and agentic baselines, we find a capability cliff: models perform competently on perceptual tasks, achieving approximately 73% on fine-grained action QA, but degrade sharply at each successive cognitive level. Agentic tasks prove hardest: the strongest model achieves only 5% accuracy when required to autonomously gather and integrate evidence across a corpus of 1.8M clips.","upvotes":4,"discussionId":"6a1d6fd1808ddbc3c7d43810","projectPage":"https://svi-bench.github.io/","githubRepo":"https://github.com/Texaser/SVI-Bench","githubRepoAddedBy":"user","ai_summary":"Strategic Video Intelligence requires understanding, causal reasoning, and planning capabilities that current benchmarks fail to evaluate adequately, leading to significant performance gaps in complex cognitive tasks.","ai_keywords":["Strategic Video Intelligence","causal reasoning","strategic planning","multimodal","agentic baselines","action QA","cognitive levels","dynamic scene understanding","strategic simulation","agentic synthesis"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0,"organization":{"_id":"669f9d1fec8789263c0e355a","name":"UNC-ChapelHill","fullname":"University of North Carolina at Chapel Hill","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/669f9c85bd649dba3b88e581/H5uB8_MCewnMtxEUnAvTL.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65d60993c2c6418adc99206a","avatarUrl":"/avatars/fb1f9d5af7022bddf4cb090cd477108d.svg","isPro":false,"fullname":"Gedas Bertasius","user":"gberta","type":"user"},{"_id":"653ea065e8ed050cb3426a0e","avatarUrl":"/avatars/9092f4e4526a3ac2248b3e0c52ef1c68.svg","isPro":false,"fullname":"Yulu Pan","user":"yulupan","type":"user"},{"_id":"66c6ceca8996dab1657276cf","avatarUrl":"/avatars/01ef49c61dd1c12d05f360fb5343c769.svg","isPro":false,"fullname":"Seongsu Ha","user":"seongsu0311","type":"user"},{"_id":"645b0ca87d8ec84b7b5ecb5d","avatarUrl":"/avatars/bccbac67d06ed71108609948e86c8839.svg","isPro":false,"fullname":"Alex Yi","user":"Alexhimself","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"669f9d1fec8789263c0e355a","name":"UNC-ChapelHill","fullname":"University of North Carolina at Chapel Hill","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/669f9c85bd649dba3b88e581/H5uB8_MCewnMtxEUnAvTL.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31529.md"}">

Papers

arxiv:2605.31529

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

Published on May 29

· Submitted by

Yulu Pan on Jun 2

University of North Carolina at Chapel Hill

Upvote

Authors:

Yulu Pan ,

Han Yi ,

Abstract

Strategic Video Intelligence requires understanding, causal reasoning, and planning capabilities that current benchmarks fail to evaluate adequately, leading to significant performance gaps in complex cognitive tasks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

True video intelligence demands more than recognizing what is visible: it requires reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. We refer to this progression, from perception through causal reasoning and simulation to strategic planning, as Strategic Video Intelligence (SVI). No existing benchmark evaluates this capability stack: in-the-wild videos lack verifiable ground truth for causal and strategic questions, while synthetic environments sacrifice the complexity of real multi-agent systems. To bridge this gap, we introduce SVI-Bench, a large-scale benchmark that leverages team sports as a dynamic microworld, combining the complexity of real-world multi-agent interaction (10-22 agents making coordinated decisions under adversarial pressure) with the verifiability of explicit rules and definitive outcomes. SVI-Bench comprises approximately 35K hours of broadcast video, 15M annotated actions, 15K hours of expert commentary, 23K game reports, and 103K structured statistical records across basketball, soccer, and hockey, all constructed via a data engine that transforms raw game data into a dense, cross-referenced corpus. We organize evaluation into 9 tasks spanning a progressive four-pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agentic Synthesis. Evaluating strong multimodal and agentic baselines, we find a capability cliff: models perform competently on perceptual tasks, achieving approximately 73% on fine-grained action QA, but degrade sharply at each successive cognitive level. Agentic tasks prove hardest: the strongest model achieves only 5% accuracy when required to autonomously gather and integrate evidence across a corpus of 1.8M clips.

View arXiv page View PDF Project page GitHub 0 Add to collection

Community

yulupan

Paper author Paper submitter about 10 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.31529

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31529 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

Abstract

Community

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers