Hugging Face Daily Papers · June 23, 2026 · 4 min read

Go-with-the-Track: Video Compositing and Motion Control with Point Tracking

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

SIGGRAPH 2026</p>\n","updatedAt":"2026-06-23T17:19:43.780Z","author":{"_id":"65aec2b6101482afcc3126f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65aec2b6101482afcc3126f2/IEKT6u1Upx2clt7qhhoUB.jpeg","fullname":"Koichi Namekata","name":"kmcode","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.21857044100761414},"editors":["kmcode"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/65aec2b6101482afcc3126f2/IEKT6u1Upx2clt7qhhoUB.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.20891","authors":[{"_id":"6a3abd770a86ac3098d5d4d7","name":"Koichi Namekata","hidden":false},{"_id":"6a3abd770a86ac3098d5d4d8","name":"Yash Kant","hidden":false},{"_id":"6a3abd770a86ac3098d5d4d9","name":"Zhizheng Liu","hidden":false},{"_id":"6a3abd770a86ac3098d5d4da","name":"Ryan D Burgert","hidden":false},{"_id":"6a3abd770a86ac3098d5d4db","name":"Yuancheng Xu","hidden":false},{"_id":"6a3abd770a86ac3098d5d4dc","name":"Kuan Heng Lin","hidden":false},{"_id":"6a3abd770a86ac3098d5d4dd","name":"Emmett Steven","hidden":false},{"_id":"6a3abd770a86ac3098d5d4de","name":"Julien Philip","hidden":false},{"_id":"6a3abd770a86ac3098d5d4df","name":"Li Ma","hidden":false},{"_id":"6a3abd770a86ac3098d5d4e0","name":"Andrea Vedaldi","hidden":false},{"_id":"6a3abd770a86ac3098d5d4e1","name":"Paul Debevec","hidden":false},{"_id":"6a3abd770a86ac3098d5d4e2","name":"Ning Yu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65aec2b6101482afcc3126f2/-Ydc32LGnP-PlodkP1c6n.mp4"],"publishedAt":"2026-06-18T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"Go-with-the-Track: Video Compositing and Motion Control with Point Tracking","submittedOnDailyBy":{"_id":"65aec2b6101482afcc3126f2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65aec2b6101482afcc3126f2/IEKT6u1Upx2clt7qhhoUB.jpeg","isPro":false,"fullname":"Koichi Namekata","user":"kmcode","type":"user","name":"kmcode"},"summary":"Filmmaking demands precise motion control and reference image compositing -- capabilities that existing methods treat separately. Point-track-conditioned image-to-video models restrict content insertion to the first frame, while reference-to-video models lack fine-grained spatial-temporal control over how reference content integrates across frames.\n We present Go-with-the-Track, which unifies both capabilities by jointly conditioning on multiple reference images and reference-anchored point-tracks -- extending conventional point-tracks to explicitly establish correspondences between generated frames and reference images, thus enabling precise compositing and motion control throughout the video.\n To achieve this, we introduce spatially-aware point-track embeddings that encode the full sequence of point-track coordinates using a coordinate-wise MLP followed by temporal pooling. This representation captures the spatial characteristics of each point-track (serving as a unique identifier), while the embedding similarity correlates directly with spatial proximity, enhancing the model's ability to distinguish and associate point-tracks. We inject these point-track embeddings into a video diffusion transformer via a lightweight adapter, resolving the pixel-to-patch resolution mismatch while avoiding the substantial motion detail loss inherent in naive point-track subsampling.\n We use a hybrid training strategy to train jointly on dynamic, static, and synthetic scene video datasets to boost motion controllability. Experiments demonstrate that Go-with-the-Track achieves superior motion and reference control in a single model and enables new capabilities: multi-reference conditioned video generation with point-track driven compositing, as well as camera control for both static and dynamic scenes. Project Page: https://eyeline-labs.github.io/Go-with-the-Track/","upvotes":2,"discussionId":"6a3abd770a86ac3098d5d4e3","projectPage":"https://eyeline-labs.github.io/Go-with-the-Track/","githubRepo":"https://github.com/Eyeline-Labs/Go-with-the-Track","githubRepoAddedBy":"user","ai_summary":"Go-with-the-Track unifies motion control and reference image compositing in video generation by using point-track embeddings with spatial-aware encoding and video diffusion transformers.","ai_keywords":["point-track-conditioned image-to-video models","reference-to-video models","video diffusion transformer","spatially-aware point-track embeddings","coordinate-wise MLP","temporal pooling","lightweight adapter","pixel-to-patch resolution mismatch","motion controllability","hybrid training strategy","multi-reference conditioned video generation","camera control"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"679181081f9717e64b2c3a38","name":"Eyeline-Labs","fullname":"Eyeline Labs","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6362bcbe8f43a912fc722969/OUSLxSkecSLaXAwRm2Ztp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"656ee8008bb9f4f8d95bd8f7","avatarUrl":"/avatars/4069d70f1279d928da521211c495d638.svg","isPro":false,"fullname":"Hyeonho Jeong","user":"hyeonho-jeong-video","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"679181081f9717e64b2c3a38","name":"Eyeline-Labs","fullname":"Eyeline Labs","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6362bcbe8f43a912fc722969/OUSLxSkecSLaXAwRm2Ztp.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.20891.md","query":{}}">

Papers

arxiv:2606.20891

Go-with-the-Track: Video Compositing and Motion Control with Point Tracking

Published on Jun 18

· Submitted by

Koichi Namekata on Jun 23

Eyeline Labs

Upvote

Authors:

Abstract

Go-with-the-Track unifies motion control and reference image compositing in video generation by using point-track embeddings with spatial-aware encoding and video diffusion transformers.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Filmmaking demands precise motion control and reference image compositing -- capabilities that existing methods treat separately. Point-track-conditioned image-to-video models restrict content insertion to the first frame, while reference-to-video models lack fine-grained spatial-temporal control over how reference content integrates across frames. We present Go-with-the-Track, which unifies both capabilities by jointly conditioning on multiple reference images and reference-anchored point-tracks -- extending conventional point-tracks to explicitly establish correspondences between generated frames and reference images, thus enabling precise compositing and motion control throughout the video. To achieve this, we introduce spatially-aware point-track embeddings that encode the full sequence of point-track coordinates using a coordinate-wise MLP followed by temporal pooling. This representation captures the spatial characteristics of each point-track (serving as a unique identifier), while the embedding similarity correlates directly with spatial proximity, enhancing the model's ability to distinguish and associate point-tracks. We inject these point-track embeddings into a video diffusion transformer via a lightweight adapter, resolving the pixel-to-patch resolution mismatch while avoiding the substantial motion detail loss inherent in naive point-track subsampling. We use a hybrid training strategy to train jointly on dynamic, static, and synthetic scene video datasets to boost motion controllability. Experiments demonstrate that Go-with-the-Track achieves superior motion and reference control in a single model and enables new capabilities: multi-reference conditioned video generation with point-track driven compositing, as well as camera control for both static and dynamic scenes. Project Page: https://eyeline-labs.github.io/Go-with-the-Track/

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

kmcode

Paper submitter about 8 hours ago

SIGGRAPH 2026

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.20891

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.20891 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.20891 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.20891 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Go-with-the-Track: Video Compositing and Motion Control with Point Tracking

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers