Code / Data: <a href=\"https://github.com/atinpothiraj/pqsg\" rel=\"nofollow\">https://github.com/atinpothiraj/pqsg</a></p>\n","updatedAt":"2026-06-25T21:37:08.837Z","author":{"_id":"5ffe32d8942cf3533d364449","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654821969191-5ffe32d8942cf3533d364449.jpeg","fullname":"Jaemin Cho","name":"j-min","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5814270377159119},"editors":["j-min"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1654821969191-5ffe32d8942cf3533d364449.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.25306","authors":[{"_id":"6a3d3a73baed870dfd869e35","name":"Atin Pothiraj","hidden":false},{"_id":"6a3d3a73baed870dfd869e36","name":"Jaemin Cho","hidden":false},{"_id":"6a3d3a73baed870dfd869e37","name":"Yue Zhang","hidden":false},{"_id":"6a3d3a73baed870dfd869e38","name":"Elias Stengel-Eskin","hidden":false},{"_id":"6a3d3a73baed870dfd869e39","name":"Mohit Bansal","hidden":false}],"publishedAt":"2026-06-24T00:00:00.000Z","submittedOnDailyAt":"2026-06-25T00:00:00.000Z","title":"Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation","submittedOnDailyBy":{"_id":"5ffe32d8942cf3533d364449","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654821969191-5ffe32d8942cf3533d364449.jpeg","isPro":false,"fullname":"Jaemin Cho","user":"j-min","type":"user","name":"j-min"},"summary":"Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in videos. We address this by introducing Physics Question Scene Graph (PQSG), a hierarchical question-based evaluation pipeline. PQSG evaluates generated videos by checking their faithfulness to a prompt across objects, actions, and adherence to physical laws using a graph-based hierarchy of questions generated by a vision-language model (VLM), guided by high-quality in-context examples. By representing questions as a graph, PQSG introduces logical dependencies within questions, ensuring that each query is contextually valid. Moreover, PQSG provides granular assessments of which qualities of the video violate physical plausibility constraints. We validate PQSG by creating FinePhyEval, a dataset with physics-based prompts and corresponding generated videos from diverse state-of-the-art video generation models (Sora 2, Veo 3, and Wan 2.1), with each video annotated across multiple categories by humans. Using FinePhyEval, we measure the correlation between PQSG's fine-grained scores and human judgments, showing higher overall correlations than prior work. We also find that PQSG ranks closed-source models higher than Wan 2.1 on physical realism. Lastly, we show that the annotations we provide in FinePhyEval can also be used for subtask evaluation: we benchmark two strong VLMs on generating and answering questions, finding that while models can create human-like questions, they still fall short of human performance in answering them.","upvotes":1,"discussionId":"6a3d3a74baed870dfd869e3a","githubRepo":"https://github.com/atinpothiraj/pqsg","githubRepoAddedBy":"user","ai_summary":"A vision-language model-based hierarchical question graph framework evaluates video generation models' adherence to physical laws with granular violation detection and human correlation validation.","ai_keywords":["video generation models","physical laws","vision-language model","scene graph","question-based evaluation","logical dependencies","fine-grained assessment","FinePhyEval","Sora 2","Veo 3","Wan 2.1"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":0},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5ffe32d8942cf3533d364449","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654821969191-5ffe32d8942cf3533d364449.jpeg","isPro":false,"fullname":"Jaemin Cho","user":"j-min","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.25306.md","query":{}}">
Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation
Abstract
A vision-language model-based hierarchical question graph framework evaluates video generation models' adherence to physical laws with granular violation detection and human correlation validation.
Video generation models are increasingly capable of producing realistic videos, but they still struggle to generate videos that follow basic physical laws. Compounding this is a lack of reliable granular evaluation methods for localizing and specifying physical law violations in videos. We address this by introducing Physics Question Scene Graph (PQSG), a hierarchical question-based evaluation pipeline. PQSG evaluates generated videos by checking their faithfulness to a prompt across objects, actions, and adherence to physical laws using a graph-based hierarchy of questions generated by a vision-language model (VLM), guided by high-quality in-context examples. By representing questions as a graph, PQSG introduces logical dependencies within questions, ensuring that each query is contextually valid. Moreover, PQSG provides granular assessments of which qualities of the video violate physical plausibility constraints. We validate PQSG by creating FinePhyEval, a dataset with physics-based prompts and corresponding generated videos from diverse state-of-the-art video generation models (Sora 2, Veo 3, and Wan 2.1), with each video annotated across multiple categories by humans. Using FinePhyEval, we measure the correlation between PQSG's fine-grained scores and human judgments, showing higher overall correlations than prior work. We also find that PQSG ranks closed-source models higher than Wan 2.1 on physical realism. Lastly, we show that the annotations we provide in FinePhyEval can also be used for subtask evaluation: we benchmark two strong VLMs on generating and answering questions, finding that while models can create human-like questions, they still fall short of human performance in answering them.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.25306 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.25306 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.25306 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.