VideoRLVR: a systematic RL recipe that turns video models into visual reasoners using verifiable rewards.</p>\n","updatedAt":"2026-05-20T05:02:18.425Z","author":{"_id":"643f9e2288d9d4488fd81c52","avatarUrl":"/avatars/e589c9cbd47022883cf33d7555bee89c.svg","fullname":"Tinghui Zhu","name":"DarthZhu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8936825394630432},"editors":["DarthZhu"],"editorAvatarUrls":["/avatars/e589c9cbd47022883cf33d7555bee89c.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15458","authors":[{"_id":"6a0c9f6865eb30f20d962990","name":"Tinghui Zhu","hidden":false},{"_id":"6a0c9f6865eb30f20d962991","name":"Sheng Zhang","hidden":false},{"_id":"6a0c9f6865eb30f20d962992","name":"James Y. Huang","hidden":false},{"_id":"6a0c9f6865eb30f20d962993","name":"Selena Song","hidden":false},{"_id":"6a0c9f6865eb30f20d962994","name":"Xiaofei Wen","hidden":false},{"_id":"6a0c9f6865eb30f20d962995","name":"Yuankai Li","hidden":false},{"_id":"6a0c9f6865eb30f20d962996","name":"Hoifung Poon","hidden":false},{"_id":"6a0c9f6865eb30f20d962997","name":"Muhao Chen","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Video Models Can Reason with Verifiable Rewards","submittedOnDailyBy":{"_id":"643f9e2288d9d4488fd81c52","avatarUrl":"/avatars/e589c9cbd47022883cf33d7555bee89c.svg","isPro":false,"fullname":"Tinghui Zhu","user":"DarthZhu","type":"user","name":"DarthZhu"},"summary":"Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.","upvotes":8,"discussionId":"6a0c9f6865eb30f20d962998","projectPage":"https://darthzhu.github.io/VideoRLVR-page/","githubRepo":"https://github.com/luka-group/VideoRLVR","githubRepoAddedBy":"user","ai_summary":"VideoRLVR optimizes video diffusion models for verifiable reasoning tasks using reinforcement learning with rule-based rewards, achieving better performance than supervised methods in constraint-satisfying video generation.","ai_keywords":["video diffusion models","reinforcement learning","verifiable rewards","SDE-GRPO","dense decomposed rewards","Early-Step Focus","policy optimization","video reasoning","procedural generation","visual trajectories"],"githubStars":1,"organization":{"_id":"69efe0ef7c1622cfce41e408","name":"luka-nlp-group","fullname":"Language Understanding and Knowledge Acquisition Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/666be8ef81f01fbd60e84f01/MIPLqx36qhQJsMYaf4zeP.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"666be8ef81f01fbd60e84f01","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/666be8ef81f01fbd60e84f01/PiCUqr7XT96HGpa_GVRLr.jpeg","isPro":false,"fullname":"Muhao Chen","user":"Muhao","type":"user"},{"_id":"6730452684c683d645e7d446","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6730452684c683d645e7d446/FNJnQutO4Pxfn7FAnLBIS.jpeg","isPro":false,"fullname":"Rui(Yanson) Cai","user":"luisrui","type":"user"},{"_id":"643f9e2288d9d4488fd81c52","avatarUrl":"/avatars/e589c9cbd47022883cf33d7555bee89c.svg","isPro":false,"fullname":"Tinghui Zhu","user":"DarthZhu","type":"user"},{"_id":"64f08fc27a581687ff8deff9","avatarUrl":"/avatars/fa1bfc2f774a99d5a809776709ce8cb0.svg","isPro":false,"fullname":"Steven Shi","user":"Stevencivor","type":"user"},{"_id":"62d65139667051e0a29bffe7","avatarUrl":"/avatars/0252aa2bcd4cf1c8e4b87e5f164b6da5.svg","isPro":false,"fullname":"Jian Xie","user":"hsaest","type":"user"},{"_id":"67031824223c62ec88541d52","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67031824223c62ec88541d52/uw_sNWmhRNPViPXEX03Qr.png","isPro":false,"fullname":"Xiaofei Wen","user":"Rakancorle1","type":"user"},{"_id":"67a81b7fd75d090a0e6c9640","avatarUrl":"/avatars/94116544d755a274d33c2b9ab1a27972.svg","isPro":false,"fullname":"anonymous","user":"Selena08","type":"user"},{"_id":"67f813fe0cf572dd23da59e2","avatarUrl":"/avatars/7e845d2185e0f77cf2213900f2c8c2f6.svg","isPro":false,"fullname":"Wen","user":"Rakancorle11","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69efe0ef7c1622cfce41e408","name":"luka-nlp-group","fullname":"Language Understanding and Knowledge Acquisition Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/666be8ef81f01fbd60e84f01/MIPLqx36qhQJsMYaf4zeP.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15458.md"}">
Video Models Can Reason with Verifiable Rewards
Abstract
VideoRLVR optimizes video diffusion models for verifiable reasoning tasks using reinforcement learning with rule-based rewards, achieving better performance than supervised methods in constraint-satisfying video generation.
AI-generated summary
Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.
Community
VideoRLVR: a systematic RL recipe that turns video models into visual reasoners using verifiable rewards.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.15458 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.15458 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.