Hugging Face Daily Papers · · 6 min read

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay-render conditions. Existing video-generation models either inject conditions through adapters or couple a generic vision-language model (VLM) with a diffusion backbone, leaving a capability gap and failing to produce videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative-intent cognition and generation. We train a specialized CogVLM using authentic anime-production data; compared to generic VLMs, it produces more professional and clearer outputs, accurately cognizing user intent from sparse and abstract conditions. CogOmniDiT unifies controls from heterogeneous conditions through in-context generation and is aligned with the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection, transforming the entire framework into a closed-loop \"harness-like\" architecture. We also introduce CogReasonBench and CogControlBench, built from professional workflow data carrying genuine creative intent. Experiments on the two benchmarks show that CogOmniControl surpasses existing open-source models.</p>\n","updatedAt":"2026-05-20T02:49:13.482Z","author":{"_id":"65d70a942db271ebd411e780","avatarUrl":"/avatars/c8bf139c8b961dd09dee35996a63f5c9.svg","fullname":"hongjiyang","name":"yang1232009","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8831643462181091},"editors":["yang1232009"],"editorAvatarUrls":["/avatars/c8bf139c8b961dd09dee35996a63f5c9.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.19995","authors":[{"_id":"6a0d20f565eb30f20d962c3f","name":"Hongji Yang","hidden":false},{"_id":"6a0d20f565eb30f20d962c40","name":"Songlian Li","hidden":false},{"_id":"6a0d20f565eb30f20d962c41","name":"Yucheng Zhou","hidden":false},{"_id":"6a0d20f565eb30f20d962c42","name":"Xiaotong Zhao","hidden":false},{"_id":"6a0d20f565eb30f20d962c43","name":"Alan Zhao","hidden":false},{"_id":"6a0d20f565eb30f20d962c44","name":"Chengzhong Xu","hidden":false},{"_id":"6a0d20f565eb30f20d962c45","name":"Jianbing Shen","hidden":false}],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition","submittedOnDailyBy":{"_id":"65d70a942db271ebd411e780","avatarUrl":"/avatars/c8bf139c8b961dd09dee35996a63f5c9.svg","isPro":true,"fullname":"hongjiyang","user":"yang1232009","type":"user","name":"yang1232009"},"summary":"Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop \"harness-like\" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/","upvotes":31,"discussionId":"6a0d20f665eb30f20d962c46","projectPage":"https://um-lab.github.io/CogOmniControl/","ai_summary":"Diffusion models applied in compressed image space generate high-quality images with lower computational cost and support flexible inputs like text or boxes.","ai_keywords":["diffusion models","video generation","creative intent cognition","CogVLM","in-context generation","reinforcement learning","Best-of-N selection","closed-loop architecture","CogReasonBench","CogControlBench"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6556ea0e0f4493529783e7a4","avatarUrl":"/avatars/df404cae1511414ce648469e9f0f0714.svg","isPro":false,"fullname":"PyBigStar","user":"PyBigStar","type":"user"},{"_id":"642cc9221f576acdab6442e5","avatarUrl":"/avatars/dd615e654bfcd3c296278b27cf342e43.svg","isPro":false,"fullname":"bo","user":"aa612","type":"user"},{"_id":"68bd1957b791a2475b68dcd2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/pFGLeSMy7gNnPq1nlAS2u.png","isPro":false,"fullname":"WuLe","user":"WuLeHash","type":"user"},{"_id":"636f37fa93d9a0c987e092fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/636f37fa93d9a0c987e092fa/vdZgFPobSIUbBTC3jlfH5.jpeg","isPro":false,"fullname":"Yucheng Zhou","user":"YCZhou","type":"user"},{"_id":"6a0d27c48cff3b26c9014f45","avatarUrl":"/avatars/e7a23e29562390c9ea6a1ab4dea840ee.svg","isPro":false,"fullname":"Lidafeng","user":"DaF417","type":"user"},{"_id":"63e7a6d3db40d9e67fef2da8","avatarUrl":"/avatars/999ce90f4ff43223d12de3902505f6ed.svg","isPro":false,"fullname":"Dobbin Chen","user":"Dobbin","type":"user"},{"_id":"662fce93deb646171f1daa58","avatarUrl":"/avatars/c3ce3d60d1a262de476a00f883d40d8d.svg","isPro":false,"fullname":"MIKE","user":"mike256","type":"user"},{"_id":"6443c5e05af87c73bbb89fa1","avatarUrl":"/avatars/56a43a406a95120c454c5cc7ef3aaedc.svg","isPro":false,"fullname":"lzd","user":"lensfa","type":"user"},{"_id":"65f8fbc6e7bb1e1319688a39","avatarUrl":"/avatars/7bfad7e2c6c89ed3404c59f499348ec2.svg","isPro":false,"fullname":"Jikai Wang","user":"jkwang92","type":"user"},{"_id":"69a05bb34029886df9ea6a1c","avatarUrl":"/avatars/17638b62ad58cb56e2f9051824e66c93.svg","isPro":false,"fullname":"Ye Wang","user":"WangYe007","type":"user"},{"_id":"65aa518b71c5d01a2832102f","avatarUrl":"/avatars/a932ccab8b513b4ecfc7ce6fc39e430e.svg","isPro":false,"fullname":"Davil Su","user":"DavilSu","type":"user"},{"_id":"66a1e0388cce07a5e60ddb27","avatarUrl":"/avatars/9394ae62947997c431be447a50319621.svg","isPro":false,"fullname":"Xingtai","user":"tabguigui","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.19995.md"}">
Papers
arxiv:2605.19995

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

Published on May 19
· Submitted by
hongjiyang
on May 20
Authors:
,
,
,
,
,
,

Abstract

Diffusion models applied in compressed image space generate high-quality images with lower computational cost and support flexible inputs like text or boxes.

AI-generated summary

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/

Community

Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay-render conditions. Existing video-generation models either inject conditions through adapters or couple a generic vision-language model (VLM) with a diffusion backbone, leaving a capability gap and failing to produce videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative-intent cognition and generation. We train a specialized CogVLM using authentic anime-production data; compared to generic VLMs, it produces more professional and clearer outputs, accurately cognizing user intent from sparse and abstract conditions. CogOmniDiT unifies controls from heterogeneous conditions through in-context generation and is aligned with the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection, transforming the entire framework into a closed-loop "harness-like" architecture. We also introduce CogReasonBench and CogControlBench, built from professional workflow data carrying genuine creative intent. Experiments on the two benchmarks show that CogOmniControl surpasses existing open-source models.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.19995
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.19995 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.19995 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.19995 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers