Hugging Face Daily Papers · · 6 min read

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.</p>\n","updatedAt":"2026-05-27T02:01:13.355Z","author":{"_id":"66f64956ad4fe83c91776459","avatarUrl":"/avatars/2014de658ac12413754dcf70bc34333e.svg","fullname":"Eddie","name":"EddieYang428","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8981645107269287},"editors":["EddieYang428"],"editorAvatarUrls":["/avatars/2014de658ac12413754dcf70bc34333e.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.23271","authors":[{"_id":"6a164ff5e9aa3c8e322db2f3","name":"Songlin Yang","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2f4","name":"Haobin Zhong","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2f5","name":"Ruilin Zhang","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2f6","name":"Xiaotong Zhao","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2f7","name":"Shuai Li","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2f8","name":"Kai Zheng","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2f9","name":"Xuyi Yang","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2fa","name":"Zhe Wang","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2fb","name":"Zhenchen Tang","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2fc","name":"Yang Li","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2fd","name":"Bohai Gu","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2fe","name":"Zhengwei Peng","hidden":false},{"_id":"6a164ff5e9aa3c8e322db2ff","name":"Yidan Huang","hidden":false},{"_id":"6a164ff5e9aa3c8e322db300","name":"Mengzhou Luo","hidden":false},{"_id":"6a164ff5e9aa3c8e322db301","name":"Yihang Bo","hidden":false},{"_id":"6a164ff5e9aa3c8e322db302","name":"Dalu Feng","hidden":false},{"_id":"6a164ff5e9aa3c8e322db303","name":"Yujia Zhang","hidden":false},{"_id":"6a164ff5e9aa3c8e322db304","name":"Juntao Ma","hidden":false},{"_id":"6a164ff5e9aa3c8e322db305","name":"Ruiqi Wang","hidden":false},{"_id":"6a164ff5e9aa3c8e322db306","name":"Lvmin Zhang","hidden":false},{"_id":"6a164ff5e9aa3c8e322db307","name":"Yuwei Guo","hidden":false},{"_id":"6a164ff5e9aa3c8e322db308","name":"Frank Guan","hidden":false},{"_id":"6a164ff5e9aa3c8e322db309","name":"Maneesh Agrawala","hidden":false},{"_id":"6a164ff5e9aa3c8e322db30a","name":"Hongbo Fu","hidden":false},{"_id":"6a164ff5e9aa3c8e322db30b","name":"Alan Zhao","hidden":false},{"_id":"6a164ff5e9aa3c8e322db30c","user":{"_id":"63f8130749569335b679af62","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f8130749569335b679af62/vgTu23-y0UKocwAGqNMwT.jpeg","isPro":false,"fullname":"Anyi Rao","user":"anyirao","type":"user","name":"anyirao"},"name":"Anyi Rao","status":"claimed_verified","statusLastChangedAt":"2026-05-27T07:42:23.887Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/66f64956ad4fe83c91776459/dyO4CzOhaNLxE9c0qJJwf.mp4"],"publishedAt":"2026-05-22T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation","submittedOnDailyBy":{"_id":"66f64956ad4fe83c91776459","avatarUrl":"/avatars/2014de658ac12413754dcf70bc34333e.svg","isPro":false,"fullname":"Eddie","user":"EddieYang428","type":"user","name":"EddieYang428"},"summary":"The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.","upvotes":26,"discussionId":"6a164ff5e9aa3c8e322db30d","ai_summary":"EvalVerse presents a comprehensive evaluation framework for generative video models that bridges the gap between human aesthetic judgment and machine scoring through expert-calibrated vision-language models and multi-stage cinematic assessment.","ai_keywords":["Reinforcement Learning","agentic workflows","video generation assessment","Vision-Language Models","expert-calibrated fine-tuning","Chain-of-Thought reasoning","evaluation taxonomy","human expert judgments","multi-shot sequencing","audio-visual integration","reward models","evaluator agent"],"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66f64956ad4fe83c91776459","avatarUrl":"/avatars/2014de658ac12413754dcf70bc34333e.svg","isPro":false,"fullname":"Eddie","user":"EddieYang428","type":"user"},{"_id":"661cd9a47c7339263b11d71a","avatarUrl":"/avatars/4ca6ea300a010b60c6a51792f92a0538.svg","isPro":false,"fullname":"jacuzzi","user":"2kxx","type":"user"},{"_id":"6474595c6d4dda6f7c6a0b6a","avatarUrl":"/avatars/61ebeff7e64a342930abe8d2506e1295.svg","isPro":false,"fullname":"shli ","user":"edward-li","type":"user"},{"_id":"65d70a942db271ebd411e780","avatarUrl":"/avatars/c8bf139c8b961dd09dee35996a63f5c9.svg","isPro":true,"fullname":"hongjiyang","user":"yang1232009","type":"user"},{"_id":"67b831eb346553e40057e2e8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b831eb346553e40057e2e8/w0_Ul0hadrEgafUIQ5Dq8.jpeg","isPro":false,"fullname":"Xu Songyu","user":"Notyourkev","type":"user"},{"_id":"64c3154d1fc43a53af1300ef","avatarUrl":"/avatars/b1407a6e3e7738d8b375a730a1c705b4.svg","isPro":false,"fullname":"Zhe Wang","user":"debugzwang","type":"user"},{"_id":"63f8130749569335b679af62","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63f8130749569335b679af62/vgTu23-y0UKocwAGqNMwT.jpeg","isPro":false,"fullname":"Anyi Rao","user":"anyirao","type":"user"},{"_id":"6428fd124fe87caede856311","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/5OrNPwZkxu3Dm1IInCxML.jpeg","isPro":false,"fullname":"Xianghao Kong","user":"refkxh","type":"user"},{"_id":"69a15099663af458fd893db3","avatarUrl":"/avatars/5b71484a9e4da83f00952fc0f0ea0373.svg","isPro":false,"fullname":"Chen Yingwen","user":"Y1nggg","type":"user"},{"_id":"67c654d66579bf3104e418d2","avatarUrl":"/avatars/24c802fce018b742e8ec1fb8daf7dc72.svg","isPro":false,"fullname":"Huang","user":"LockieH","type":"user"},{"_id":"6411c801e872ae3fb1e2c96e","avatarUrl":"/avatars/f8898dc13d700e545eedbbfab1c18353.svg","isPro":true,"fullname":"Franklin","user":"Franklinzhang","type":"user"},{"_id":"6380ea1af496d57325c1fdf5","avatarUrl":"/avatars/dac4bb56ca1ed40329972cd8f8936b22.svg","isPro":false,"fullname":"Finn","user":"Finnbingo","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"66543b6e420092799d2f625c","name":"tencent","fullname":"Tencent","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/Lp3m-XLpjQGwBItlvn69q.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.23271.md"}">
Papers
arxiv:2605.23271

EvalVerse: Pipeline-Aware and Expert-Calibrated Benchmarking for Professional Cinematic Video Generation

Published on May 22
· Submitted by
Eddie
on May 27
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

EvalVerse presents a comprehensive evaluation framework for generative video models that bridges the gap between human aesthetic judgment and machine scoring through expert-calibrated vision-language models and multi-stage cinematic assessment.

AI-generated summary

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.

Community

The rapid evolution of generative video foundation models has propelled the field toward professional-grade cinematic synthesis. To achieve such demanding quality, the community transitions towards Reinforcement Learning (RL) and agentic workflows. However, reliable evaluation has emerged as a critical bottleneck. Existing benchmarks predominantly evaluate ''whether it is right'' (basic prompt-following) while fundamentally neglecting ''whether it is good'' (cinematic quality, acting, and aesthetics). Furthermore, current automated metrics lack the domain-specific rigor required to provide trustworthy signals, creating a severe credibility gap between human aesthetic perception and machine scoring. To bridge this gap, we introduce EvalVerse, a comprehensive, pipeline-aware, and expert-calibrated evaluation framework. We treat video generation assessment not merely as an engineering task, but as a core scientific problem: the systematic digitization of subjective cinematic expertise. First, we organize domain knowledge into an evaluation taxonomy aligned with the professional filmmaking workflow (pre-production, production, and post-production). Second, we distill human expert judgments into a curated dataset with large-scale human annotations. Third, we inject this knowledge into Vision-Language Models (VLMs) through an expert-calibrated fine-tuning strategy, enabling the VLM to perform explicit Chain-of-Thought reasoning. Compared to previous works, EvalVerse not only retains compatibility with foundational ''rightness'' metrics, but also significantly expands the criteria to ''goodness'' and broaden the task coverage to complex multi-shot sequencing and audio-visual integration. Consequently, by providing granular diagnostic signals, EvalVerse transcends a static leaderboard and establishes a fundamental infrastructure for future work, such as reward models and evaluator agent.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.23271
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.23271 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.23271 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.23271 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers