Hugging Face Daily Papers · June 2, 2026 · 4 min read

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Video Generation Models can produce temporally coherent visual trajectories, yet often fail to follow task-specific rules. We introduce a VLM-as-Teacher framework that synthesizes task-specific reward queries and guides a VGM Reasoner through online test-time optimization of a lightweight LoRA module.</p>\n","updatedAt":"2026-06-02T03:31:58.635Z","author":{"_id":"6506b77a773ceaa8d52ecea1","avatarUrl":"/avatars/0e769a0795063e1491c44760a4a83097.svg","fullname":"CJH","name":"Howe666","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7915430068969727},"editors":["Howe666"],"editorAvatarUrls":["/avatars/0e769a0795063e1491c44760a4a83097.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02564","authors":[{"_id":"6a1e4e7a808ddbc3c7d43d19","name":"Junhao Cheng","hidden":false},{"_id":"6a1e4e7a808ddbc3c7d43d1a","name":"Liang Hou","hidden":false},{"_id":"6a1e4e7a808ddbc3c7d43d1b","name":"Tianxiong Zhong","hidden":false},{"_id":"6a1e4e7a808ddbc3c7d43d1c","name":"Xin Tao","hidden":false},{"_id":"6a1e4e7a808ddbc3c7d43d1d","name":"Pengfei Wan","hidden":false},{"_id":"6a1e4e7a808ddbc3c7d43d1e","name":"Kun Gai","hidden":false},{"_id":"6a1e4e7a808ddbc3c7d43d1f","name":"Jing Liao","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization","submittedOnDailyBy":{"_id":"6506b77a773ceaa8d52ecea1","avatarUrl":"/avatars/0e769a0795063e1491c44760a4a83097.svg","isPro":false,"fullname":"CJH","user":"Howe666","type":"user","name":"Howe666"},"summary":"The recent \"Reasoning with Video\" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to \"teachers\". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/","upvotes":19,"discussionId":"6a1e4e7a808ddbc3c7d43d20","projectPage":"https://vlm-as-teacher.github.io/","ai_summary":"Video generation models combined with vision-language models acting as test-time teachers through differentiable rewards achieve superior video reasoning performance.","ai_keywords":["Video Generation Models","Vision-Language Models","test-time optimization","LoRA module","differentiable rewards","video reasoning benchmarks","VBVR-Bench","RULER-Bench"],"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6506b77a773ceaa8d52ecea1","avatarUrl":"/avatars/0e769a0795063e1491c44760a4a83097.svg","isPro":false,"fullname":"CJH","user":"Howe666","type":"user"},{"_id":"67ed5a9edf57c2d2b70f642d","avatarUrl":"/avatars/6483ca52c3d9fcd9c99917f9b5ce64f7.svg","isPro":false,"fullname":"Jiageng Wang","user":"TheWind-upBird","type":"user"},{"_id":"6826a05566f27e294e84507c","avatarUrl":"/avatars/ed611fab42227bab6a7584ecfdfa6f59.svg","isPro":false,"fullname":"abc","user":"pqowieuryrty","type":"user"},{"_id":"668a17d58c440fe1956448ba","avatarUrl":"/avatars/16fdfcdba1a7819f95c55c031ab27de5.svg","isPro":false,"fullname":"Xi Lu","user":"ln-e1","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"662f93942510ef5735d7ad00","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/662f93942510ef5735d7ad00/ZIDIPm63sncIHFTT5b0uR.png","isPro":false,"fullname":"magicwpf","user":"magicwpf","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"69cceed2eb9cdf88f2a50b65","avatarUrl":"/avatars/07cb3a54017fa3cfd87e0e6a65969e6f.svg","isPro":false,"fullname":"Qianyi Tang","user":"jacobrodriguezu","type":"user"},{"_id":"69ce390201d713064aea5864","avatarUrl":"/avatars/af3977aeb5432599fb6b576c3f64a46b.svg","isPro":false,"fullname":"Bohan Zeng","user":"zbhpku1","type":"user"},{"_id":"668df98de9e585e8718f767f","avatarUrl":"/avatars/2be52f4ae88a0991c8ae584f8e870734.svg","isPro":false,"fullname":"Xiangyang Luo","user":"XiangyangLuo02","type":"user"},{"_id":"69213991ce9fdaab0fec99e6","avatarUrl":"/avatars/9261b298aeee2c655747eb7c9a889754.svg","isPro":false,"fullname":"mini","user":"ge4edu","type":"user"},{"_id":"69cd3de429da70aa42016d49","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/wMKvVTPKlkKqFmChfDYif.png","isPro":false,"fullname":"Xinyu Zhang","user":"gaohaoyu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"662c559b322afcbae51b3c8b","name":"KlingTeam","fullname":"Kling Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/60e272ca6c78a8c122b12127/ZQV1aKLUDPf2rUcxxAqj6.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02564.md"}">

Papers

arxiv:2606.02564

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Published on Jun 1

· Submitted by

CJH on Jun 2

Kling Team

Upvote

Authors:

Abstract

Video generation models combined with vision-language models acting as test-time teachers through differentiable rewards achieve superior video reasoning performance.

AI-generated summary

The recent "Reasoning with Video" paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to "teachers". Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM's intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: https://VLM-as-Teacher.github.io/