Project page: <a href=\"https://shariqfarooq123.github.io/LooseControlVideo/\" rel=\"nofollow\">https://shariqfarooq123.github.io/LooseControlVideo/</a></p>\n","updatedAt":"2026-06-19T01:41:07.053Z","author":{"_id":"63fcae4f987f631186e50fba","avatarUrl":"/avatars/8ac9dff40d1b483b4f2f4e4cbd3f9088.svg","fullname":"Shariq Farooq Bhat","name":"shariqfarooq","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":47,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.48702603578567505},"editors":["shariqfarooq"],"editorAvatarUrls":["/avatars/8ac9dff40d1b483b4f2f4e4cbd3f9088.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.19495","authors":[{"_id":"6a349cc94c5c5e0d69bf1bac","name":"Shariq Farooq Bhat","hidden":false},{"_id":"6a349cc94c5c5e0d69bf1bad","name":"Niloy J. Mitra","hidden":false},{"_id":"6a349cc94c5c5e0d69bf1bae","name":"Kalyan Sunkavalli","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-19T00:00:00.000Z","title":"LooseControlVideo: Directorial Video Control using Spatial Blocking","submittedOnDailyBy":{"_id":"63fcae4f987f631186e50fba","avatarUrl":"/avatars/8ac9dff40d1b483b4f2f4e4cbd3f9088.svg","isPro":false,"fullname":"Shariq Farooq Bhat","user":"shariqfarooq","type":"user","name":"shariqfarooq"},"summary":"Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a \"blocking\" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.","upvotes":0,"discussionId":"6a349cca4c5c5e0d69bf1baf","projectPage":"https://shariqfarooq123.github.io/LooseControlVideo/","ai_summary":"LooseControlVideo enables intuitive 3D spatial control in text-to-video generation using sparse oriented 3D boxes as proxies, achieving superior trajectory accuracy and occlusion handling compared to existing methods.","ai_keywords":["text-to-video generation","3D spatial orchestration","multi-object scenes","semantic layout","temporal dynamics","depth-conditioned models","frame-accurate guidance","video generative model","DNOCS","3D size","3D orientation","depth-ordered occlusions","Wan 2.2 backbone","video dataset","nuScenes","HO-3D","BEHAVE","Trajectory Error","Rigid Motion Consistency","Occlusion Accuracy"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"61e5d14f77496de0a6d95c6b","name":"adobe","fullname":"Adobe","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1645217431826-61e35e517ac6b6d06cfa8081.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"61e5d14f77496de0a6d95c6b","name":"adobe","fullname":"Adobe","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1645217431826-61e35e517ac6b6d06cfa8081.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.19495.md","query":{}}">
LooseControlVideo: Directorial Video Control using Spatial Blocking
Abstract
LooseControlVideo enables intuitive 3D spatial control in text-to-video generation using sparse oriented 3D boxes as proxies, achieving superior trajectory accuracy and occlusion handling compared to existing methods.
Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.19495 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.19495 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.19495 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.