Hugging Face Daily Papers · June 3, 2026 · 3 min read

Benchmarking Visual State Tracking in Multimodal Video Understanding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

VSTAT: <a href=\"https://vision-x-nyu.github.io/vstat-site/\" rel=\"nofollow\">https://vision-x-nyu.github.io/vstat-site/</a></p>\n","updatedAt":"2026-06-03T02:06:58.851Z","author":{"_id":"65d14d80818f0593463fee32","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d14d80818f0593463fee32/5dG3GwfzuMA9j_DhNMVsg.jpeg","fullname":"Pinzhi Huang","name":"EdwinHuang","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.914639949798584},"editors":["EdwinHuang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65d14d80818f0593463fee32/5dG3GwfzuMA9j_DhNMVsg.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.03920","authors":[{"_id":"6a1f8baae292c1c78ecb12d5","name":"Sihyun Yu","hidden":false},{"_id":"6a1f8baae292c1c78ecb12d6","name":"Nanye Ma","hidden":false},{"_id":"6a1f8baae292c1c78ecb12d7","name":"Pinzhi Huang","hidden":false},{"_id":"6a1f8baae292c1c78ecb12d8","name":"Hyunseok Lee","hidden":false},{"_id":"6a1f8baae292c1c78ecb12d9","name":"Shusheng Yang","hidden":false},{"_id":"6a1f8baae292c1c78ecb12da","name":"June Suk Choi","hidden":false},{"_id":"6a1f8baae292c1c78ecb12db","name":"Ellis Brown","hidden":false},{"_id":"6a1f8baae292c1c78ecb12dc","name":"Oscar Michel","hidden":false},{"_id":"6a1f8baae292c1c78ecb12dd","name":"Boyang Zheng","hidden":false},{"_id":"6a1f8baae292c1c78ecb12de","name":"Jinwoo Shin","hidden":false},{"_id":"6a1f8baae292c1c78ecb12df","name":"Saining Xie","hidden":false}],"publishedAt":"2026-06-02T00:00:00.000Z","submittedOnDailyAt":"2026-06-03T00:00:00.000Z","title":"Benchmarking Visual State Tracking in Multimodal Video Understanding","submittedOnDailyBy":{"_id":"65d14d80818f0593463fee32","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d14d80818f0593463fee32/5dG3GwfzuMA9j_DhNMVsg.jpeg","isPro":true,"fullname":"Pinzhi Huang","user":"EdwinHuang","type":"user","name":"EdwinHuang"},"summary":"Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.","upvotes":4,"discussionId":"6a1f8baae292c1c78ecb12e0","projectPage":"https://vision-x-nyu.github.io/vstat-site/","githubRepo":"https://github.com/vision-x-nyu/vstat","githubRepoAddedBy":"user","ai_summary":"Current multimodal large language models struggle with visual state tracking in videos, performing poorly even when human-level capabilities are required, and existing agentic approaches do not effectively address these limitations.","ai_keywords":["Multimodal Large Language Models","visual state tracking","video understanding","VSTAT benchmark","continuous perception","event integration","reasoning traces","video agents","coding agents"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":9,"organization":{"_id":"662741612ada5b77e310d171","name":"nyu-visionx","fullname":"VISIONx @ NYU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/626dc5105f7327906f0b2a4e/Kn-QtZjE6TJE-syTndXIW.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64722d14609ae9f5636b6e7d","avatarUrl":"/avatars/153447cd0b1c354a2692817a39aec5b7.svg","isPro":true,"fullname":"Sihyun Yu","user":"sihyun-yu","type":"user"},{"_id":"63180254212fce5a3cdc57a5","avatarUrl":"/avatars/9229d1ce9500f9b1a1ff1c4f6856ac10.svg","isPro":false,"fullname":"L","user":"TaidanaHito","type":"user"},{"_id":"65d14d80818f0593463fee32","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d14d80818f0593463fee32/5dG3GwfzuMA9j_DhNMVsg.jpeg","isPro":true,"fullname":"Pinzhi Huang","user":"EdwinHuang","type":"user"},{"_id":"64b776ce81fbedb3938e7c0f","avatarUrl":"/avatars/684ad0a998acd9266e988fffbf396a4c.svg","isPro":false,"fullname":"HyunseokLee","user":"hyunseoki","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"662741612ada5b77e310d171","name":"nyu-visionx","fullname":"VISIONx @ NYU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/626dc5105f7327906f0b2a4e/Kn-QtZjE6TJE-syTndXIW.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.03920.md"}">

Papers

arxiv:2606.03920

Benchmarking Visual State Tracking in Multimodal Video Understanding

Published on Jun 2

· Submitted by

Pinzhi Huang on Jun 3

VISIONx @ NYU

Upvote

Authors:

Abstract

Current multimodal large language models struggle with visual state tracking in videos, performing poorly even when human-level capabilities are required, and existing agentic approaches do not effectively address these limitations.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

EdwinHuang

Paper submitter about 11 hours ago

•

edited about 11 hours ago

VSTAT: https://vision-x-nyu.github.io/vstat-site/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.03920

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.03920 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.03920 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Benchmarking Visual State Tracking in Multimodal Video Understanding

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers