Hugging Face Daily Papers · · 6 min read

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in<br>order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue<br>that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable,<br>compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point<br>motion forecasting: given a short visual history, a set of 3D query points on an object of interest,<br>and a language description of the intended goal, the model predicts the future 3D trajectory of each<br>point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of<br>action-described, object-grounded 3D point trajectory dataset annotated from 1.16M unconstrained<br>videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and<br>61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both<br>autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion<br>is able to accurately predicts diverse motion patterns with different language instructions, and<br>significantly outperforms all existing motion prediction baselines on PointMotionBench. Finally, we<br>show that the learned 3D motion prior transfers well to downstream applications: it improves training<br>efficiency and generalization for robot manipulation, and its predicted trajectories provide effective<br>motion guidance for generative models to synthesize videos with more realistic object motion.</p>\n","updatedAt":"2026-06-18T13:42:02.257Z","author":{"_id":"689fa34c88990497b16463e0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/DwoPsvcG1ZF1Z2B_MOk6w.png","fullname":"Jianing Zhang","name":"jnzhang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false,"primaryOrg":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/652db071b62cf1f8463221e2/CxxwFiaomTa1MCX_B7-pT.png","fullname":"Ai2","name":"allenai","type":"org","isHf":false,"details":"Building breatkthrough AI to solve the world's biggest problems.","plan":"enterprise"}}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8634350895881653},"editors":["jnzhang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/DwoPsvcG1ZF1Z2B_MOk6w.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18558","authors":[{"_id":"6a335c4259127a45e2c1c5eb","user":{"_id":"689fa34c88990497b16463e0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/DwoPsvcG1ZF1Z2B_MOk6w.png","isPro":false,"fullname":"Jianing Zhang","user":"jnzhang","type":"user","name":"jnzhang"},"name":"Jianing Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-18T11:26:46.630Z","hidden":false},{"_id":"6a335c4259127a45e2c1c5ec","name":"Chenhao Zheng","hidden":false},{"_id":"6a335c4259127a45e2c1c5ed","name":"Yajun Yang","hidden":false},{"_id":"6a335c4259127a45e2c1c5ee","name":"Max Argus","hidden":false},{"_id":"6a335c4259127a45e2c1c5ef","name":"Rustin Soraki","hidden":false},{"_id":"6a335c4259127a45e2c1c5f0","name":"Winson Han","hidden":false},{"_id":"6a335c4259127a45e2c1c5f1","name":"Taira Anderson","hidden":false},{"_id":"6a335c4259127a45e2c1c5f2","name":"Chun-Liang Li","hidden":false},{"_id":"6a335c4259127a45e2c1c5f3","name":"Shuo Liu","hidden":false},{"_id":"6a335c4259127a45e2c1c5f4","name":"Jiafei Duan","hidden":false},{"_id":"6a335c4259127a45e2c1c5f5","name":"Zhongzheng Ren","hidden":false},{"_id":"6a335c4259127a45e2c1c5f6","name":"Jieyu Zhang","hidden":false},{"_id":"6a335c4259127a45e2c1c5f7","name":"Ranjay Krishna","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/689fa34c88990497b16463e0/wIaiPW_3v4iASBkmIfvn_.mp4"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction","submittedOnDailyBy":{"_id":"689fa34c88990497b16463e0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/DwoPsvcG1ZF1Z2B_MOk6w.png","isPro":false,"fullname":"Jianing Zhang","user":"jnzhang","type":"user","name":"jnzhang"},"summary":"Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.","upvotes":7,"discussionId":"6a335c4259127a45e2c1c5f8","projectPage":"https://allenai.org/blog/molmo-motion","githubRepo":"https://github.com/allenai/molmo-motion","githubRepoAddedBy":"user","ai_summary":"3D point motion forecasting model predicts object trajectories from visual history and language goals, demonstrating superior performance on benchmarks and transferring effectively to robot manipulation and video generation tasks.","ai_keywords":["motion forecasting","3D point trajectories","goal-conditioned","language description","autoregressive coordinate prediction","flow-matching-based trajectory generation","robot manipulation","generative models","video synthesis"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":35,"organization":{"_id":"5e70f3648ce3c604d78fe132","name":"allenai","fullname":"Ai2","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/652db071b62cf1f8463221e2/CxxwFiaomTa1MCX_B7-pT.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"689fa34c88990497b16463e0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/DwoPsvcG1ZF1Z2B_MOk6w.png","isPro":false,"fullname":"Jianing Zhang","user":"jnzhang","type":"user"},{"_id":"6839fc2000c48087d3ee098f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/8Lq4r3lFQyprLVgGQfKRz.png","isPro":false,"fullname":"Chenhao Zheng","user":"michaelzch001","type":"user"},{"_id":"69e6898af2d84cf3dc46f6c7","avatarUrl":"/avatars/c4addd9364ef9f13f042c66fcb4e5643.svg","isPro":false,"fullname":"Chenhao Zheng","user":"michaelzch666","type":"user"},{"_id":"5ffe32d8942cf3533d364449","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1654821969191-5ffe32d8942cf3533d364449.jpeg","isPro":false,"fullname":"Jaemin Cho","user":"j-min","type":"user"},{"_id":"6a33f915389a3e00a1c05eb4","avatarUrl":"/avatars/a72d3010191607d2f8268b1069ec5d48.svg","isPro":false,"fullname":"Neil Zhang","user":"neilz26","type":"user"},{"_id":"663252b0febe847d074f66c1","avatarUrl":"/avatars/1245525955d15203b5144d86cbcc7595.svg","isPro":true,"fullname":"Kyle","user":"iky1e","type":"user"},{"_id":"632b42626110e37dba3d5bcb","avatarUrl":"/avatars/ca70a15def71ee84f4f149db5e954843.svg","isPro":false,"fullname":"Duan","user":"Jiafei1224","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e70f3648ce3c604d78fe132","name":"allenai","fullname":"Ai2","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/652db071b62cf1f8463221e2/CxxwFiaomTa1MCX_B7-pT.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18558.md","query":{}}">
Papers
arxiv:2606.18558

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

Published on Jun 17
· Submitted by
Jianing Zhang
on Jun 18
Authors:
,
,
,
,
,
,
,
,
,
,
,

Abstract

3D point motion forecasting model predicts object trajectories from visual history and language goals, demonstrating superior performance on benchmarks and transferring effectively to robot manipulation and video generation tasks.

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

Community

Paper author Paper submitter about 2 hours ago

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in
order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue
that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable,
compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point
motion forecasting: given a short visual history, a set of 3D query points on an object of interest,
and a language description of the intended goal, the model predicts the future 3D trajectory of each
point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of
action-described, object-grounded 3D point trajectory dataset annotated from 1.16M unconstrained
videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and
61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both
autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion
is able to accurately predicts diverse motion patterns with different language instructions, and
significantly outperforms all existing motion prediction baselines on PointMotionBench. Finally, we
show that the learned 3D motion prior transfers well to downstream applications: it improves training
efficiency and generalization for robot manipulation, and its predicted trajectories provide effective
motion guidance for generative models to synthesize videos with more realistic object motion.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.18558
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 2

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.18558 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers