Quantitative Video World Model Evaluation for Geometric-Consistency
Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.
Quantitative Video World Model Evaluation for Geometric-Consistency
Abstract
A quantitative framework called PDI-Bench is introduced for evaluating geometric coherence in generated videos through monocular reconstruction and projective-geometry residuals, revealing geometry-specific failure modes in video generators.
Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at https://pdi-bench.github.io/.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
More from Hugging Face Daily Papers
-
Nexus : An Agentic Framework for Time Series Forecasting
May 16
-
Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis
May 16
-
ViMU: Benchmarking Video Metaphorical Understanding
May 16
-
Aligning Latent Geometry for Spherical Flow Matching in Image Generation
May 15
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.