Hugging Face Daily Papers · · 4 min read

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

MotiMotion addresses an important limitation of current motion-controlled video generation systems: user-provided trajectories are often sparse and incomplete, yet existing methods follow them too rigidly, leading to unrealistic motion and missing causal effects. This paper introduces a reasoning-then-generation framework that uses vision-language models to interpret user intent, refine trajectories, and infer physically plausible secondary interactions before video synthesis. The proposed benchmark further highlights the need for commonsense and physics-aware evaluation in video generation. Overall, the paper presents a practical and well-motivated step toward more intelligent, controllable, and realistic video generation systems.</p>\n","updatedAt":"2026-05-26T22:14:51.432Z","author":{"_id":"66d9603860730595a885c29f","avatarUrl":"/avatars/8617e61bebbbc3ba065280b5c6225481.svg","fullname":"Hsin-Ying","name":"shinying","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8523614406585693},"editors":["shinying"],"editorAvatarUrls":["/avatars/8617e61bebbbc3ba065280b5c6225481.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22818","authors":[{"_id":"6a0fbf40a53a61ce2e422c84","user":{"_id":"66d9603860730595a885c29f","avatarUrl":"/avatars/8617e61bebbbc3ba065280b5c6225481.svg","isPro":false,"fullname":"Hsin-Ying","user":"shinying","type":"user","name":"shinying"},"name":"Lee Hsin-Ying","status":"claimed_verified","statusLastChangedAt":"2026-05-25T15:14:36.123Z","hidden":false},{"_id":"6a0fbf40a53a61ce2e422c85","name":"Hanwen Jiang","hidden":false},{"_id":"6a0fbf40a53a61ce2e422c86","name":"Yiqun Mei","hidden":false},{"_id":"6a0fbf40a53a61ce2e422c87","name":"Jing Shi","hidden":false},{"_id":"6a0fbf40a53a61ce2e422c88","name":"Ming-Hsuan Yang","hidden":false},{"_id":"6a0fbf40a53a61ce2e422c89","name":"Zhixin Shu","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-26T00:00:00.000Z","title":"MotiMotion: Motion-Controlled Video Generation with Visual Reasoning","submittedOnDailyBy":{"_id":"66d9603860730595a885c29f","avatarUrl":"/avatars/8617e61bebbbc3ba065280b5c6225481.svg","isPro":false,"fullname":"Hsin-Ying","user":"shinying","type":"user","name":"shinying"},"summary":"Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.","upvotes":0,"discussionId":"6a0fbf40a53a61ce2e422c8a","projectPage":"https://motimotion.github.io/","githubRepo":"https://github.com/motimotion/motimotion","githubRepoAddedBy":"user","ai_summary":"MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that improves plausibility through vision-language reasoning and confidence-aware control mechanisms.","ai_keywords":["motion control","vision-language reasoner","image-space coordinates","secondary motions","confidence-aware control","guidance strength","generative priors","image-to-video benchmark","MotiBench"],"githubStars":2,"organization":{"_id":"637b318856db0404b7c5a0c2","name":"adobe-research","fullname":"Adobe Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669033410364-624bebf604abc7ebb01789af.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[],"acceptLanguages":["en"],"organization":{"_id":"637b318856db0404b7c5a0c2","name":"adobe-research","fullname":"Adobe Research","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669033410364-624bebf604abc7ebb01789af.png"}}">
Papers
arxiv:2605.22818

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Published on May 21
· Submitted by
Hsin-Ying
on May 26
Authors:
,
,
,
,

Abstract

MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that improves plausibility through vision-language reasoning and confidence-aware control mechanisms.

AI-generated summary

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

Community

Paper author Paper submitter about 3 hours ago

MotiMotion addresses an important limitation of current motion-controlled video generation systems: user-provided trajectories are often sparse and incomplete, yet existing methods follow them too rigidly, leading to unrealistic motion and missing causal effects. This paper introduces a reasoning-then-generation framework that uses vision-language models to interpret user intent, refine trajectories, and infer physically plausible secondary interactions before video synthesis. The proposed benchmark further highlights the need for commonsense and physics-aware evaluation in video generation. Overall, the paper presents a practical and well-motivated step toward more intelligent, controllable, and realistic video generation systems.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22818 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22818 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers