SmartDirector introduces a two-stage framework that empowers cinematic video generation with precise narrative pacing and high-fidelity detail recovery by leveraging a novel Multi-Chunk VAE strategy to circumvent temporal causal constraints.</p>\n","updatedAt":"2026-05-29T01:51:42.086Z","author":{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","fullname":"Jun Liang","name":"utopiar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8837757706642151},"editors":["utopiar"],"editorAvatarUrls":["/avatars/367731ce1c71d1e19ff415a52ae4067d.svg"],"reactions":[],"isReport":false}},{"id":"6a19241623437faa3766366e","author":{"_id":"6687f9a71309e08b1f84bdc6","avatarUrl":"/avatars/f947ec9fe620ae4cffa83b371acdd571.svg","fullname":"MeiYi","name":"natalie5","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":31,"isUserFollowing":false},"createdAt":"2026-05-29T05:28:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"It would be awesome to have weights of the model to experiment and improve upon it. 🙏🙏🙏","html":"<p>It would be awesome to have weights of the model to experiment and improve upon it. 🙏🙏🙏</p>\n","updatedAt":"2026-05-29T05:28:54.937Z","author":{"_id":"6687f9a71309e08b1f84bdc6","avatarUrl":"/avatars/f947ec9fe620ae4cffa83b371acdd571.svg","fullname":"MeiYi","name":"natalie5","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":31,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9884610772132874},"editors":["natalie5"],"editorAvatarUrls":["/avatars/f947ec9fe620ae4cffa83b371acdd571.svg"],"reactions":[],"isReport":false}},{"id":"6a1a413914ed232317fe7594","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:45:29.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior](https://huggingface.co/papers/2604.17195) (2026)\n* [DrawVideo: Generating Long Video from Storyboard Keyframe Sketches](https://huggingface.co/papers/2605.23508) (2026)\n* [Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration](https://huggingface.co/papers/2605.17423) (2026)\n* [Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation](https://huggingface.co/papers/2604.03738) (2026)\n* [Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation](https://huggingface.co/papers/2604.09195) (2026)\n* [CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives](https://huggingface.co/papers/2605.12496) (2026)\n* [MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation](https://huggingface.co/papers/2604.23789) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.17195\">DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23508\">DrawVideo: Generating Long Video from Storyboard Keyframe Sketches</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.17423\">Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.03738\">Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.09195\">Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12496\">CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.23789\">MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:45:29.990Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.703510046005249},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27891","authors":[{"_id":"6a18f08c56b4bb14ec65ce04","name":"Zhida Zhang","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce05","name":"Jie Ma","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce06","name":"Zhan Peng","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce07","name":"Haoxue Wu","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce08","name":"Yang Han","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce09","user":{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","isPro":false,"fullname":"Jun Liang","user":"utopiar","type":"user","name":"utopiar"},"name":"Jun Liang","status":"claimed_verified","statusLastChangedAt":"2026-05-29T09:32:25.349Z","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce0a","name":"Jie Cao","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce0b","name":"Jing Li","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control","submittedOnDailyBy":{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","isPro":false,"fullname":"Jun Liang","user":"utopiar","type":"user","name":"utopiar"},"summary":"The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.","upvotes":3,"discussionId":"6a18f08c56b4bb14ec65ce0c","projectPage":"https://orange-3dv-team.github.io/SmartDirector/","githubRepo":"https://github.com/Orange-3DV-Team/SmartDirector","githubRepoAddedBy":"user","ai_summary":"SmartDirector enhances video generation by using multiple keyframes to improve narrative structure and temporal pacing through a two-stage process of low-resolution generation and high-resolution refinement.","ai_keywords":["video generation","keyframes","narrative structure","temporal pacing","multi-shot narrative synthesis","video extension","low-resolution video","high-resolution refinement","data pipeline","single-shot generation"],"githubStars":12,"organization":{"_id":"69670c472ad4f2c0e892575c","name":"Orange-Team","fullname":"Orange Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6455afeabda0fbba412d4922/Sy7zZn0kb-Q1SCSUwOn-9.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","isPro":false,"fullname":"Jun Liang","user":"utopiar","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"61454a989cd783fec339bdd0","avatarUrl":"/avatars/39cc15c0a70e0d2b1f1ef1c7a98e7db8.svg","isPro":false,"fullname":"Xi Yang","user":"ianyeung","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69670c472ad4f2c0e892575c","name":"Orange-Team","fullname":"Orange Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6455afeabda0fbba412d4922/Sy7zZn0kb-Q1SCSUwOn-9.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27891.md"}">
SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control
Abstract
SmartDirector enhances video generation by using multiple keyframes to improve narrative structure and temporal pacing through a two-stage process of low-resolution generation and high-resolution refinement.
AI-generated summary
The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.
Community
SmartDirector introduces a two-stage framework that empowers cinematic video generation with precise narrative pacing and high-fidelity detail recovery by leveraging a novel Multi-Chunk VAE strategy to circumvent temporal causal constraints.
It would be awesome to have weights of the model to experiment and improve upon it. 🙏🙏🙏
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.27891 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.27891 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.27891 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.