Hugging Face Daily Papers · May 26, 2026 · 4 min read

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

MotiMotion addresses an important limitation of current motion-controlled video generation systems: user-provided trajectories are often sparse and incomplete, yet existing methods follow them too rigidly, leading to unrealistic motion and missing causal effects. This paper introduces a reasoning-then-generation framework that uses vision-language models to interpret user intent, refine trajectories, and infer physically plausible secondary interactions before video synthesis. The proposed benchmark further highlights the need for commonsense and physics-aware evaluation in video generation. Overall, the paper presents a practical and well-motivated step toward more intelligent, controllable, and realistic video generation systems.</p>\n","updatedAt":"2026-05-26T22:14:51.432Z","author":{"_id":"66d9603860730595a885c29f","avatarUrl":"/avatars/8617e61bebbbc3ba065280b5c6225481.svg","fullname":"Hsin-Ying","name":"shinying","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8523614406585693},"editors":["shinying"],"editorAvatarUrls":["/avatars/8617e61bebbbc3ba065280b5c6225481.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22818","authors":[{"_id":"6a0fbf40a53a61ce2e422c84","user":{"_id":"66d9603860730595a885c29f","avatarUrl":"/avatars/8617e61bebbbc3ba065280b5c6225481.svg","isPro":false,"fullname":"Hsin-Ying","user":"shinying","type":"user","name":"shinying"},"name":"Lee Hsin-Ying","status":"claimed_verified","statusLastChangedAt":"2026-05-25T15:14:36.123Z","hidden":false},{"_id":"6a0fbf40a53a61ce2e422c85","name":"Hanwen Jiang","hidden":false},{"_id":"6a0fbf40a53a61ce2e422c86","name":"Yiqun Mei","hidden":false},{"_id":"6a0fbf40a53a61ce2e422c87","name":"Jing Shi","hidden":false},{"_id":"6a0fbf40a53a61ce2e422c88","name":"Ming-Hsuan Yang","hidden":false},{"_id":"6a0fbf40a53a61ce2e422c89","name":"Zhixin Shu","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"MotiMotion: Motion-Controlled Video Generation with Visual Reasoning","submittedOnDailyBy":{"_id":"66d9603860730595a885c29f","avatarUrl":"/avatars/8617e61bebbbc3ba065280b5c6225481.svg","isPro":false,"fullname":"Hsin-Ying","user":"shinying","type":"user","name":"shinying"},"summary":"Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.","upvotes":0,"discussionId":"6a0fbf40a53a61ce2e422c8a","projectPage":"https://motimotion.github.io/","githubRepo":"https://github.com/motimotion/motimotion","githubRepoAddedBy":"user","ai_summary":"MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that improves plausibility through vision-language reasoning and confidence-aware control mechanisms.","ai_keywords":["motion control","vision-language reasoner","image-space coordinates","secondary motions","confidence-aware control","guidance strength","generative priors","image-to-video benchmark","MotiBench"],"githubStars":2,"organization":{"_id":"637b318856db0404b7c5a0c2","name":"adobe-research","fullname":"Adobe Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669033410364-624bebf604abc7ebb01789af.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"637b318856db0404b7c5a0c2","name":"adobe-research","fullname":"Adobe Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669033410364-624bebf604abc7ebb01789af.png"}}">

Papers

arxiv:2605.22818

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Published on May 21

· Submitted by

Hsin-Ying on May 26

Adobe Research

Upvote

Authors:

Lee Hsin-Ying ,

Abstract

MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that improves plausibility through vision-language reasoning and confidence-aware control mechanisms.

AI-generated summary

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

shinying

Paper author Paper submitter about 3 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22818 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22818 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers