Github: <a href=\"https://github.com/google-deepmind/physics-iq-benchmark\" rel=\"nofollow\">https://github.com/google-deepmind/physics-iq-benchmark</a></p>\n","updatedAt":"2026-06-18T02:14:58.035Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":319,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5252999663352966},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18943","authors":[{"_id":"6a33547659127a45e2c1c574","name":"Tim Rädsch","hidden":false},{"_id":"6a33547659127a45e2c1c575","name":"Yuki M Asano","hidden":false},{"_id":"6a33547659127a45e2c1c576","name":"Hilde Kuehne","hidden":false},{"_id":"6a33547659127a45e2c1c577","name":"Stefan Bauer","hidden":false},{"_id":"6a33547659127a45e2c1c578","name":"Priyank Jaini","hidden":false},{"_id":"6a33547659127a45e2c1c579","name":"Robert Geirhos","hidden":false},{"_id":"6a33547659127a45e2c1c57a","name":"Carsten T. Lüth","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Physics-IQ Verified","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\\% of all samples and improves over 34.8\\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's τ= 0.46). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark","upvotes":0,"discussionId":"6a33547759127a45e2c1c57b","ai_summary":"A systematic evaluation of the Physics-IQ benchmark reveals limitations in measuring physical understanding of video generative models, leading to improvements in prompt quality and sample-level scoring that enhance reliability for assessing physically accurate video generation.","ai_keywords":["video generative models","Physics-IQ benchmark","world modeling","physical understanding","sample-level scoring","Kendall's τ"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18943.md","query":{}}">
Abstract
A systematic evaluation of the Physics-IQ benchmark reveals limitations in measuring physical understanding of video generative models, leading to improvements in prompt quality and sample-level scoring that enhance reliability for assessing physically accurate video generation.
Video generative models ( VGMs) have become a new frontier that can be used not just for video generation but for a multitude of downstream tasks, including world modeling. To advance these tasks, a good video model must understand the physical reality of the world. Evaluating this understanding is an emerging field and has led to the Physics-IQ benchmark, which quantifies this explicitly by comparing model-generated videos to real-world videos of physical experiments. In this work, we present a systematic audit of the Physics-IQ benchmark, expose shortcomings and propose three solutions that sharpen how we can measure physical understanding of VGMs. Specifically, we improve prompt and ground-truth quality to reduce the influence of confounding factors and further introduce a sample-level scoring system that weights each sample and metric equally. Our resulting benchmark, Physics-IQ Verified, refines 57.6\% of all samples and improves over 34.8\% of prompts. In a comparison study using six image-to-video generative models, we observe moderate but meaningful ranking changes (Kendall's τ= 0.46). We hope Physics-IQ Verified advances the community by providing a more reliable signal toward physically accurate VGMs. The code for the benchmark can be accessed at https://github.com/google-deepmind/physics-iq-benchmark
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.18943 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.18943 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.18943 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.