Hugging Face Daily Papers · May 29, 2026 · 6 min read

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

SmartDirector introduces a two-stage framework that empowers cinematic video generation with precise narrative pacing and high-fidelity detail recovery by leveraging a novel Multi-Chunk VAE strategy to circumvent temporal causal constraints.\n","updatedAt":"2026-05-29T01:51:42.086Z","author":{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","fullname":"Jun Liang","name":"utopiar","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8837757706642151},"editors":["utopiar"],"editorAvatarUrls":["/avatars/367731ce1c71d1e19ff415a52ae4067d.svg"],"reactions":[],"isReport":false}},{"id":"6a19241623437faa3766366e","author":{"_id":"6687f9a71309e08b1f84bdc6","avatarUrl":"/avatars/f947ec9fe620ae4cffa83b371acdd571.svg","fullname":"MeiYi","name":"natalie5","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":31,"isUserFollowing":false},"createdAt":"2026-05-29T05:28:54.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"It would be awesome to have weights of the model to experiment and improve upon it. 🙏🙏🙏","html":"It would be awesome to have weights of the model to experiment and improve upon it. 🙏🙏🙏\n","updatedAt":"2026-05-29T05:28:54.937Z","author":{"_id":"6687f9a71309e08b1f84bdc6","avatarUrl":"/avatars/f947ec9fe620ae4cffa83b371acdd571.svg","fullname":"MeiYi","name":"natalie5","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":31,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9884610772132874},"editors":["natalie5"],"editorAvatarUrls":["/avatars/f947ec9fe620ae4cffa83b371acdd571.svg"],"reactions":[],"isReport":false}},{"id":"6a1a413914ed232317fe7594","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:45:29.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior](https://huggingface.co/papers/2604.17195) (2026)\n* [DrawVideo: Generating Long Video from Storyboard Keyframe Sketches](https://huggingface.co/papers/2605.23508) (2026)\n* [Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration](https://huggingface.co/papers/2605.17423) (2026)\n* [Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation](https://huggingface.co/papers/2604.03738) (2026)\n* [Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation](https://huggingface.co/papers/2604.09195) (2026)\n* [CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives](https://huggingface.co/papers/2605.12496) (2026)\n* [MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation](https://huggingface.co/papers/2604.23789) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.17195\">DreamShot: Personalized Storyboard Synthesis with Video Diffusion Prior</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.23508\">DrawVideo: Generating Long Video from Storyboard Keyframe Sketches</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.17423\">Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.03738\">Rethinking Position Embedding as a Context Controller for Multi-Reference and Multi-Shot Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.09195\">Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12496\">CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.23789\">MuSS: A Large-Scale Dataset and Cinematic Narrative Benchmark for Multi-Shot Subject-to-Video Generation</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-30T01:45:29.990Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.703510046005249},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27891","authors":[{"_id":"6a18f08c56b4bb14ec65ce04","name":"Zhida Zhang","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce05","name":"Jie Ma","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce06","name":"Zhan Peng","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce07","name":"Haoxue Wu","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce08","name":"Yang Han","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce09","user":{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","isPro":false,"fullname":"Jun Liang","user":"utopiar","type":"user","name":"utopiar"},"name":"Jun Liang","status":"claimed_verified","statusLastChangedAt":"2026-05-29T09:32:25.349Z","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce0a","name":"Jie Cao","hidden":false},{"_id":"6a18f08c56b4bb14ec65ce0b","name":"Jing Li","hidden":false}],"publishedAt":"2026-05-27T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control","submittedOnDailyBy":{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","isPro":false,"fullname":"Jun Liang","user":"utopiar","type":"user","name":"utopiar"},"summary":"The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.","upvotes":3,"discussionId":"6a18f08c56b4bb14ec65ce0c","projectPage":"https://orange-3dv-team.github.io/SmartDirector/","githubRepo":"https://github.com/Orange-3DV-Team/SmartDirector","githubRepoAddedBy":"user","ai_summary":"SmartDirector enhances video generation by using multiple keyframes to improve narrative structure and temporal pacing through a two-stage process of low-resolution generation and high-resolution refinement.","ai_keywords":["video generation","keyframes","narrative structure","temporal pacing","multi-shot narrative synthesis","video extension","low-resolution video","high-resolution refinement","data pipeline","single-shot generation"],"githubStars":12,"organization":{"_id":"69670c472ad4f2c0e892575c","name":"Orange-Team","fullname":"Orange Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6455afeabda0fbba412d4922/Sy7zZn0kb-Q1SCSUwOn-9.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6455afeabda0fbba412d4922","avatarUrl":"/avatars/367731ce1c71d1e19ff415a52ae4067d.svg","isPro":false,"fullname":"Jun Liang","user":"utopiar","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"61454a989cd783fec339bdd0","avatarUrl":"/avatars/39cc15c0a70e0d2b1f1ef1c7a98e7db8.svg","isPro":false,"fullname":"Xi Yang","user":"ianyeung","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69670c472ad4f2c0e892575c","name":"Orange-Team","fullname":"Orange Team","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6455afeabda0fbba412d4922/Sy7zZn0kb-Q1SCSUwOn-9.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27891.md"}">

Papers

arxiv:2605.27891

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

Published on May 27

· Submitted by

Jun Liang on May 29

Orange Team

Upvote

Authors:

Jun Liang ,

Abstract

SmartDirector enhances video generation by using multiple keyframes to improve narrative structure and temporal pacing through a two-stage process of low-resolution generation and high-resolution refinement.

AI-generated summary

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.