Hugging Face Daily Papers · · 3 min read

Video Models Can Reason with Verifiable Rewards

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

VideoRLVR: a systematic RL recipe that turns video models into visual reasoners using verifiable rewards.</p>\n","updatedAt":"2026-05-20T05:02:18.425Z","author":{"_id":"643f9e2288d9d4488fd81c52","avatarUrl":"/avatars/e589c9cbd47022883cf33d7555bee89c.svg","fullname":"Tinghui Zhu","name":"DarthZhu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8936825394630432},"editors":["DarthZhu"],"editorAvatarUrls":["/avatars/e589c9cbd47022883cf33d7555bee89c.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15458","authors":[{"_id":"6a0c9f6865eb30f20d962990","name":"Tinghui Zhu","hidden":false},{"_id":"6a0c9f6865eb30f20d962991","name":"Sheng Zhang","hidden":false},{"_id":"6a0c9f6865eb30f20d962992","name":"James Y. Huang","hidden":false},{"_id":"6a0c9f6865eb30f20d962993","name":"Selena Song","hidden":false},{"_id":"6a0c9f6865eb30f20d962994","name":"Xiaofei Wen","hidden":false},{"_id":"6a0c9f6865eb30f20d962995","name":"Yuankai Li","hidden":false},{"_id":"6a0c9f6865eb30f20d962996","name":"Hoifung Poon","hidden":false},{"_id":"6a0c9f6865eb30f20d962997","name":"Muhao Chen","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Video Models Can Reason with Verifiable Rewards","submittedOnDailyBy":{"_id":"643f9e2288d9d4488fd81c52","avatarUrl":"/avatars/e589c9cbd47022883cf33d7555bee89c.svg","isPro":false,"fullname":"Tinghui Zhu","user":"DarthZhu","type":"user","name":"DarthZhu"},"summary":"Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.","upvotes":8,"discussionId":"6a0c9f6865eb30f20d962998","projectPage":"https://darthzhu.github.io/VideoRLVR-page/","githubRepo":"https://github.com/luka-group/VideoRLVR","githubRepoAddedBy":"user","ai_summary":"VideoRLVR optimizes video diffusion models for verifiable reasoning tasks using reinforcement learning with rule-based rewards, achieving better performance than supervised methods in constraint-satisfying video generation.","ai_keywords":["video diffusion models","reinforcement learning","verifiable rewards","SDE-GRPO","dense decomposed rewards","Early-Step Focus","policy optimization","video reasoning","procedural generation","visual trajectories"],"githubStars":1,"organization":{"_id":"69efe0ef7c1622cfce41e408","name":"luka-nlp-group","fullname":"Language Understanding and Knowledge Acquisition Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/666be8ef81f01fbd60e84f01/MIPLqx36qhQJsMYaf4zeP.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"666be8ef81f01fbd60e84f01","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/666be8ef81f01fbd60e84f01/PiCUqr7XT96HGpa_GVRLr.jpeg","isPro":false,"fullname":"Muhao Chen","user":"Muhao","type":"user"},{"_id":"6730452684c683d645e7d446","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6730452684c683d645e7d446/FNJnQutO4Pxfn7FAnLBIS.jpeg","isPro":false,"fullname":"Rui(Yanson) Cai","user":"luisrui","type":"user"},{"_id":"643f9e2288d9d4488fd81c52","avatarUrl":"/avatars/e589c9cbd47022883cf33d7555bee89c.svg","isPro":false,"fullname":"Tinghui Zhu","user":"DarthZhu","type":"user"},{"_id":"64f08fc27a581687ff8deff9","avatarUrl":"/avatars/fa1bfc2f774a99d5a809776709ce8cb0.svg","isPro":false,"fullname":"Steven Shi","user":"Stevencivor","type":"user"},{"_id":"62d65139667051e0a29bffe7","avatarUrl":"/avatars/0252aa2bcd4cf1c8e4b87e5f164b6da5.svg","isPro":false,"fullname":"Jian Xie","user":"hsaest","type":"user"},{"_id":"67031824223c62ec88541d52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67031824223c62ec88541d52/uw_sNWmhRNPViPXEX03Qr.png","isPro":false,"fullname":"Xiaofei Wen","user":"Rakancorle1","type":"user"},{"_id":"67a81b7fd75d090a0e6c9640","avatarUrl":"/avatars/94116544d755a274d33c2b9ab1a27972.svg","isPro":false,"fullname":"anonymous","user":"Selena08","type":"user"},{"_id":"67f813fe0cf572dd23da59e2","avatarUrl":"/avatars/7e845d2185e0f77cf2213900f2c8c2f6.svg","isPro":false,"fullname":"Wen","user":"Rakancorle11","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69efe0ef7c1622cfce41e408","name":"luka-nlp-group","fullname":"Language Understanding and Knowledge Acquisition Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/666be8ef81f01fbd60e84f01/MIPLqx36qhQJsMYaf4zeP.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15458.md"}">
Papers
arxiv:2605.15458

Video Models Can Reason with Verifiable Rewards

Published on May 14
· Submitted by
Tinghui Zhu
on May 20
Authors:
,
,
,
,
,
,
,

Abstract

VideoRLVR optimizes video diffusion models for verifiable reasoning tasks using reinforcement learning with rule-based rewards, achieving better performance than supervised methods in constraint-satisfying video generation.

AI-generated summary

Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.

Community

Paper submitter about 8 hours ago

VideoRLVR: a systematic RL recipe that turns video models into visual reasoners using verifiable rewards.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.15458
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.15458 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.15458 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers