Hugging Face Daily Papers · · 4 min read

LooseControlVideo: Directorial Video Control using Spatial Blocking

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Project page: <a href=\"https://shariqfarooq123.github.io/LooseControlVideo/\" rel=\"nofollow\">https://shariqfarooq123.github.io/LooseControlVideo/</a></p>\n","updatedAt":"2026-06-19T01:41:07.053Z","author":{"_id":"63fcae4f987f631186e50fba","avatarUrl":"/avatars/8ac9dff40d1b483b4f2f4e4cbd3f9088.svg","fullname":"Shariq Farooq Bhat","name":"shariqfarooq","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":47,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.48702603578567505},"editors":["shariqfarooq"],"editorAvatarUrls":["/avatars/8ac9dff40d1b483b4f2f4e4cbd3f9088.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.19495","authors":[{"_id":"6a349cc94c5c5e0d69bf1bac","name":"Shariq Farooq Bhat","hidden":false},{"_id":"6a349cc94c5c5e0d69bf1bad","name":"Niloy J. Mitra","hidden":false},{"_id":"6a349cc94c5c5e0d69bf1bae","name":"Kalyan Sunkavalli","hidden":false}],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-19T00:00:00.000Z","title":"LooseControlVideo: Directorial Video Control using Spatial Blocking","submittedOnDailyBy":{"_id":"63fcae4f987f631186e50fba","avatarUrl":"/avatars/8ac9dff40d1b483b4f2f4e4cbd3f9088.svg","isPro":false,"fullname":"Shariq Farooq Bhat","user":"shariqfarooq","type":"user","name":"shariqfarooq"},"summary":"Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a \"blocking\" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.","upvotes":0,"discussionId":"6a349cca4c5c5e0d69bf1baf","projectPage":"https://shariqfarooq123.github.io/LooseControlVideo/","ai_summary":"LooseControlVideo enables intuitive 3D spatial control in text-to-video generation using sparse oriented 3D boxes as proxies, achieving superior trajectory accuracy and occlusion handling compared to existing methods.","ai_keywords":["text-to-video generation","3D spatial orchestration","multi-object scenes","semantic layout","temporal dynamics","depth-conditioned models","frame-accurate guidance","video generative model","DNOCS","3D size","3D orientation","depth-ordered occlusions","Wan 2.2 backbone","video dataset","nuScenes","HO-3D","BEHAVE","Trajectory Error","Rigid Motion Consistency","Occlusion Accuracy"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"61e5d14f77496de0a6d95c6b","name":"adobe","fullname":"Adobe","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1645217431826-61e35e517ac6b6d06cfa8081.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"61e5d14f77496de0a6d95c6b","name":"adobe","fullname":"Adobe","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1645217431826-61e35e517ac6b6d06cfa8081.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.19495.md","query":{}}">
Papers
arxiv:2606.19495

LooseControlVideo: Directorial Video Control using Spatial Blocking

Published on Jun 17
· Submitted by
Shariq Farooq Bhat
on Jun 19
Authors:
,
,

Abstract

LooseControlVideo enables intuitive 3D spatial control in text-to-video generation using sparse oriented 3D boxes as proxies, achieving superior trajectory accuracy and occlusion handling compared to existing methods.

Precise 3D spatial orchestration in text-to-video generation remains a significant challenge, particularly for multi-object scenes where semantic layout and temporal dynamics are often entangled. While existing depth-conditioned models achieve good structural fidelity, they necessitate dense, frame-accurate guidance that is labor-intensive to author for dynamic events involving deformable objects. We present LooseControlVideo, a framework that enables intuitive and expressive control by using sparse, oriented 3D boxes as a "blocking" proxy. This allows users to author high-level layout and trajectory while leveraging a video generative model to generate realistic occlusions, dynamics and interactions. We achieve this by fine-tuning a Wan 2.2 backbone on a video dataset annotated with DNOCS, a novel encoding for 3D size, orientation and depth-ordered occlusions. Furthermore, our method allows for localized refinement, such as adjusting a jump trajectory or adding an interaction, with minimal disruption to the global scene context. Extensive evaluations on the nuScenes, HO-3D, and BEHAVE benchmarks demonstrate that LooseControlVideo significantly outperforms existing 2D-box and flow-based baselines. Our findings indicate a 1.2x to 3x improvement in Trajectory Error; 2x improvement in Rigid Motion Consistency; and a 1.5x to 2x increase in Occlusion Accuracy over current state-of-the-art layout-conditioned models, demonstrating that oriented 3D primitives provide good geometric prior for complex, multi-agent video authoring.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.19495
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.19495 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.19495 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.19495 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers