Hugging Face Daily Papers · May 23, 2026 · 4 min read

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Awesome work!\n","updatedAt":"2026-05-23T05:31:22.430Z","author":{"_id":"64846709ea6c1813962acc0a","avatarUrl":"/avatars/1cb653a0e61a1e8d7c70351e1080bf8e.svg","fullname":"Jihwan Kim","name":"navvh","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8502424359321594},"editors":["navvh"],"editorAvatarUrls":["/avatars/1cb653a0e61a1e8d7c70351e1080bf8e.svg"],"reactions":[{"reaction":"🔥","users":["lanikoworld"],"count":1}],"isReport":false}},{"id":"6a13bc7fc78f64a04775ea97","author":{"_id":"665008e8d5bea69bca060eb3","avatarUrl":"/avatars/ebaa83fc7ed9eb9a0924fb37d5662abe.svg","fullname":"Jinho Park","name":"zino1","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false},"createdAt":"2026-05-25T03:05:35.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We introduce VGenST-Bench — a video benchmark for spatio-temporal \nreasoning in MLLMs, built via active video synthesis. \n\nOur multi-agent pipeline produces controlled videos across a 3×2×2 taxonomy (Spatial Scale × Viewpoint \n× Scene Dynamics) with 12 task categories and a 3-level QA hierarchy.\n\nProject page: https://zinosii.github.io/VGenST-Bench/\nCode: https://github.com/zinosii/VGenST-Bench","html":"We introduce VGenST-Bench — a video benchmark for spatio-temporal reasoning in MLLMs, built via active video synthesis. \nOur multi-agent pipeline produces controlled videos across a 3×2×2 taxonomy (Spatial Scale × Viewpoint × Scene Dynamics) with 12 task categories and a 3-level QA hierarchy.\nProject page: <a href=\"https://zinosii.github.io/VGenST-Bench/\" rel=\"nofollow\">https://zinosii.github.io/VGenST-Bench/</a> Code: <a href=\"https://github.com/zinosii/VGenST-Bench\" rel=\"nofollow\">https://github.com/zinosii/VGenST-Bench</a>\n","updatedAt":"2026-05-25T03:05:35.764Z","author":{"_id":"665008e8d5bea69bca060eb3","avatarUrl":"/avatars/ebaa83fc7ed9eb9a0924fb37d5662abe.svg","fullname":"Jinho Park","name":"zino1","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.716688334941864},"editors":["zino1"],"editorAvatarUrls":["/avatars/ebaa83fc7ed9eb9a0924fb37d5662abe.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22570","authors":[{"_id":"6a0fe2b4a53a61ce2e422dcb","user":{"_id":"665008e8d5bea69bca060eb3","avatarUrl":"/avatars/ebaa83fc7ed9eb9a0924fb37d5662abe.svg","isPro":false,"fullname":"Jinho Park","user":"zino1","type":"user","name":"zino1"},"name":"Jinho Park","status":"admin_assigned","statusLastChangedAt":"2026-05-22T21:16:41.892Z","hidden":false},{"_id":"6a0fe2b4a53a61ce2e422dcc","name":"Youbin Kim","hidden":false},{"_id":"6a0fe2b4a53a61ce2e422dcd","name":"Hogun Park","hidden":false},{"_id":"6a0fe2b4a53a61ce2e422dce","name":"Eunbyung Park","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/665008e8d5bea69bca060eb3/x4tfVa-sf8vv9EzC57go6.mp4"],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-25T00:00:00.000Z","title":"VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis","submittedOnDailyBy":{"_id":"665008e8d5bea69bca060eb3","avatarUrl":"/avatars/ebaa83fc7ed9eb9a0924fb37d5662abe.svg","isPro":false,"fullname":"Jinho Park","user":"zino1","type":"user","name":"zino1"},"summary":"Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.","upvotes":13,"discussionId":"6a0fe2b5a53a61ce2e422dcf","projectPage":"https://zinosii.github.io/VGenST-Bench/","githubRepo":"https://github.com/zinosii/VGenST-Bench","githubRepoAddedBy":"user","ai_summary":"VGenST-Bench presents a video benchmark using generative models for active synthesis of controlled spatio-temporal reasoning scenarios with human quality control.","ai_keywords":["Multimodal Large Language Models","spatio-temporal reasoning","video benchmark","generative models","multi-agent pipeline","video taxonomy","hierarchical task suite"],"githubStars":4},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64846709ea6c1813962acc0a","avatarUrl":"/avatars/1cb653a0e61a1e8d7c70351e1080bf8e.svg","isPro":true,"fullname":"Jihwan Kim","user":"navvh","type":"user"},{"_id":"665008e8d5bea69bca060eb3","avatarUrl":"/avatars/ebaa83fc7ed9eb9a0924fb37d5662abe.svg","isPro":false,"fullname":"Jinho Park","user":"zino1","type":"user"},{"_id":"68d379df9745df272981f62b","avatarUrl":"/avatars/8dff6361fa14ea7eb362d7d5a28d7bab.svg","isPro":false,"fullname":"Minhyeok Roh","user":"GilbertStrang","type":"user"},{"_id":"6574932cd0ed8f5761069ac2","avatarUrl":"/avatars/2a6b6b9e8b01179eacc92d2a9cbf5df9.svg","isPro":false,"fullname":"Seungtae","user":"stnamjef","type":"user"},{"_id":"69bd330d7c3b35b1d9544ed0","avatarUrl":"/avatars/58cfa0542725546c7b28b2519366c911.svg","isPro":false,"fullname":"jyeon","user":"parkcoolman","type":"user"},{"_id":"6938bffd30ae9cd2878d9061","avatarUrl":"/avatars/5fd204fa75074dc06a3095931957a308.svg","isPro":false,"fullname":"RyanJeong","user":"HwasikJeong","type":"user"},{"_id":"667684c6585f2bf570233534","avatarUrl":"/avatars/0a709967fe72485029e43fb111c94a67.svg","isPro":false,"fullname":"Jungwoo Kim","user":"jungcow","type":"user"},{"_id":"66a12715dae8f7ffd63b0a1a","avatarUrl":"/avatars/4ab72e47573a0a8766920ed8ce3f8de7.svg","isPro":false,"fullname":"youbin kim","user":"ubin108","type":"user"},{"_id":"6742e770459000b812f3a276","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/nsEgzIm_-mPQXgYXXh05k.png","isPro":false,"fullname":"Lani Ko","user":"lanikoworld","type":"user"},{"_id":"6a13f22843a15f3e2e3e390b","avatarUrl":"/avatars/140100ab7fea35ee6f18681a5b27af13.svg","isPro":false,"fullname":"51hy30k","user":"51hy30k","type":"user"},{"_id":"67db8a0ea70b71c5833b22b5","avatarUrl":"/avatars/72048e56ac3aaa2d46a490faae7f343a.svg","isPro":false,"fullname":"Seungkwon Yang","user":"skyang0","type":"user"},{"_id":"66902cf374a2d2f7bf2613cc","avatarUrl":"/avatars/8c1322f0f9c0a2f3b9480ca73045b416.svg","isPro":false,"fullname":"woojeong","user":"bwj2800","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22570.md"}">

Papers

arxiv:2605.22570

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

Published on May 21

· Submitted by

Jinho Park on May 25

Upvote

Authors:

Jinho Park ,

Abstract

VGenST-Bench presents a video benchmark using generative models for active synthesis of controlled spatio-temporal reasoning scenarios with human quality control.

AI-generated summary

Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively curated video data, which limits the evaluation of fine-grained reasoning capabilities. In this paper, we introduce VGenST-Bench, a video benchmark that employs generative models to actively synthesize highly controlled and diverse evaluation scenarios. To construct VGenST-Bench, we propose a multi-agent pipeline incorporating a human quality control stage, ensuring the quality of all generated videos and QA pairs. We establish a comprehensive 3x2x2 video taxonomy, encompassing Spatial Scale, Perspective, and Scene Dynamics to span diverse scenarios. Furthermore, we design a hierarchical task suite that decouples low-level visual perception from high-level spatio-temporal reasoning. By shifting the paradigm from passive curation to active synthesis, VGenST-Bench enables fine-grained diagnosis of spatio-temporal understanding in MLLMs.

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

navvh

2 days ago

Awesome work!

zino1

Paper author Paper submitter about 8 hours ago

We introduce VGenST-Bench — a video benchmark for spatio-temporal
reasoning in MLLMs, built via active video synthesis.

Our multi-agent pipeline produces controlled videos across a 3×2×2 taxonomy (Spatial Scale × Viewpoint
× Scene Dynamics) with 12 task categories and a 3-level QA hierarchy.

Project page: https://zinosii.github.io/VGenST-Bench/
Code: https://github.com/zinosii/VGenST-Bench

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.22570

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22570 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22570 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers