Hugging Face Daily Papers · May 14, 2026 · 6 min read

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Dense 3D tracking from monocular video is fundamental to dynamic scene under- standing. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D re- construction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising founda- tion for 3D tracking. However, their frame-anchored formulation, which generates each frame’s content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking pointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that uses per-frame geometry latents and reference-anchored track latents as dense queries, and (ii) temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame generative paradigm of video DiTs into a reference-anchored tracking formulation with LoRA fine-tuning. TrackCraft3R achieves state-of-the-art performance on standard sparse and dense 3D tracking benchmarks, while running 1.3×faster and using 4.6×less peak memory than the strongest prior method. We further demonstrate robustness to large motions and long videos.\n","updatedAt":"2026-05-14T04:34:46.140Z","author":{"_id":"637c7420f219c71f93ec8f81","avatarUrl":"/avatars/969b72bd4320423af89e6a5d0ffa03cc.svg","fullname":"frog","name":"frog123123123123","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8661104440689087},"editors":["frog123123123123"],"editorAvatarUrls":["/avatars/969b72bd4320423af89e6a5d0ffa03cc.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.12587","authors":[{"_id":"6a055089b1a8cbabc9f088a2","name":"Jisu Nam","hidden":false},{"_id":"6a055089b1a8cbabc9f088a3","name":"Jahyeok Koo","hidden":false},{"_id":"6a055089b1a8cbabc9f088a4","name":"Soowon Son","hidden":false},{"_id":"6a055089b1a8cbabc9f088a5","name":"Jaewoo Jung","hidden":false},{"_id":"6a055089b1a8cbabc9f088a6","name":"Honggyu An","hidden":false},{"_id":"6a055089b1a8cbabc9f088a7","name":"Junhwa Hur","hidden":false},{"_id":"6a055089b1a8cbabc9f088a8","name":"Seungryong Kim","hidden":false}],"publishedAt":"2026-05-12T00:00:00.000Z","submittedOnDailyAt":"2026-05-14T00:00:00.000Z","title":"TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking","submittedOnDailyBy":{"_id":"637c7420f219c71f93ec8f81","avatarUrl":"/avatars/969b72bd4320423af89e6a5d0ffa03cc.svg","isPro":false,"fullname":"frog","user":"frog123123123123","type":"user","name":"frog123123123123"},"summary":"Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame's content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking pointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that uses per-frame geometry latents and reference-anchored track latents as dense queries, and (ii) temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame generative paradigm of video DiTs into a reference-anchored tracking formulation with LoRA fine-tuning. TrackCraft3R achieves state-of-the-art performance on standard sparse and dense 3D tracking benchmarks, while running 1.3x faster and using 4.6x less peak memory than the strongest prior method. We further demonstrate robustness to large motions and long videos.","upvotes":30,"discussionId":"6a055089b1a8cbabc9f088a9","projectPage":"https://cvlab-kaist.github.io/TrackCraft3r","githubRepo":"https://github.com/cvlab-kaist/TrackCraft3r","githubRepoAddedBy":"user","ai_summary":"TrackCraft3R enables efficient dense 3D tracking from monocular video by adapting video diffusion transformers to follow physical points across frames using dual-latent representation and temporal RoPE alignment.","ai_keywords":["video diffusion transformers","video DiTs","dense 3D tracking","reference-anchored tracking","per-frame geometry latents","track latents","temporal RoPE alignment","LoRA fine-tuning"],"githubStars":45,"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"637c7420f219c71f93ec8f81","avatarUrl":"/avatars/969b72bd4320423af89e6a5d0ffa03cc.svg","isPro":false,"fullname":"frog","user":"frog123123123123","type":"user"},{"_id":"67565151e704deb871d99c95","avatarUrl":"/avatars/b7f7b0f29f6879de3c9586fe7ba9d6b9.svg","isPro":false,"fullname":"honggyu An","user":"honggyuAn","type":"user"},{"_id":"6752ac9be0c39c0eaf6ba90d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/cbByUmYoPVUAr35MWQeVm.png","isPro":false,"fullname":"lee","user":"lshlsh","type":"user"},{"_id":"670e9c95d4017185342aa443","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/i1lnH2GqKJdAnBPtOkbUf.png","isPro":false,"fullname":"jahyeok","user":"koomang","type":"user"},{"_id":"651277c2b6ffd31931db5290","avatarUrl":"/avatars/8495b84e8aed407da07908ee829e0510.svg","isPro":false,"fullname":"JihoPark","user":"jiho31","type":"user"},{"_id":"66f7a013380c9f89971c078d","avatarUrl":"/avatars/a01d393575aacd7f74a12c22fe43a30d.svg","isPro":true,"fullname":"HyunWook Choi","user":"enrue1893","type":"user"},{"_id":"668e24b961b6eff5a8ad2d34","avatarUrl":"/avatars/f3a4e2e766110e687010a702ed96d7af.svg","isPro":false,"fullname":"chloe","user":"chloe1929","type":"user"},{"_id":"6752b6315281c3cae4b0783f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xmcyVEl2xBhk3G5_7dmpz.png","isPro":false,"fullname":"Paul Hyunbin Cho","user":"paulcho98","type":"user"},{"_id":"64cb5884d469fc2cf83bdd76","avatarUrl":"/avatars/10e63cf62d8200beef3e31846796e398.svg","isPro":false,"fullname":"JisooKim","user":"Jiiiiiisoo","type":"user"},{"_id":"65d02dc017e2b305e0d7bf4f","avatarUrl":"/avatars/2a50fd0541e7b0e200c577a661956696.svg","isPro":true,"fullname":"Jaewoo Jung","user":"crepejung00","type":"user"},{"_id":"66c3473270eace5a9900a067","avatarUrl":"/avatars/189247253c8d9dbe745165a7fcdd7151.svg","isPro":true,"fullname":"Dahyun Chung","user":"dhyun22","type":"user"},{"_id":"69b8d158bd8e1d2d307dff61","avatarUrl":"/avatars/0f724bb2ca98888c7b751a5929dba974.svg","isPro":false,"fullname":"Eunju Yang","user":"boreum0302","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.12587.md"}">

Papers

arxiv:2605.12587

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

Published on May 12

· Submitted by

frog on May 14

Google

Upvote

Authors:

Abstract

TrackCraft3R enables efficient dense 3D tracking from monocular video by adapting video diffusion transformers to follow physical points across frames using dual-latent representation and temporal RoPE alignment.

AI-generated summary

Dense 3D tracking from monocular video is fundamental to dynamic scene understanding. While recent 3D foundation models provide reliable per-frame geometry, recovering object motion in this geometry remains challenging and benefits from strong motion priors learned from real-world videos. Existing 3D trackers either follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D reconstruction models learned from static multi-view images, both lacking real-world motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich spatio-temporal priors from internet-scale videos, making them a promising foundation for 3D tracking. However, their frame-anchored formulation, which generates each frame's content, is fundamentally mismatched with reference-anchored dense 3D tracking, which must follow the same physical points from a reference frame across time. We present TrackCraft3R, the first method to repurpose a video DiT as a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking pointmap that follows every pixel of the first frame across time in a single forward pass, along with its visibility. We achieve this through two designs: (i) a dual-latent representation that uses per-frame geometry latents and reference-anchored track latents as dense queries, and (ii) temporal RoPE alignment, which specifies the target timestamp of each track latent. Together, these designs convert the per-frame generative paradigm of video DiTs into a reference-anchored tracking formulation with LoRA fine-tuning. TrackCraft3R achieves state-of-the-art performance on standard sparse and dense 3D tracking benchmarks, while running 1.3x faster and using 4.6x less peak memory than the strongest prior method. We further demonstrate robustness to large motions and long videos.

View arXiv page View PDF Project page GitHub 45 Add to collection

Community

frog123123123123

Paper submitter about 21 hours ago

Dense 3D tracking from monocular video is fundamental to dynamic scene under-
standing. While recent 3D foundation models provide reliable per-frame geometry,
recovering object motion in this geometry remains challenging and benefits from
strong motion priors learned from real-world videos. Existing 3D trackers either
follow iterative paradigms trained from scratch on synthetic data or fine-tune 3D re-
construction models learned from static multi-view images, both lacking real-world
motion priors. Pre-trained video diffusion transformers (video DiTs) offer rich
spatio-temporal priors from internet-scale videos, making them a promising founda-
tion for 3D tracking. However, their frame-anchored formulation, which generates
each frame’s content, is fundamentally mismatched with reference-anchored dense
3D tracking, which must follow the same physical points from a reference frame
across time. We present TrackCraft3R, the first method to repurpose a video DiT as
a feed-forward dense 3D tracker. Given a monocular video and its frame-anchored
reconstruction pointmap, TrackCraft3R predicts a reference-anchored tracking
pointmap that follows every pixel of the first frame across time in a single forward
pass, along with its visibility. We achieve this through two designs: (i) a dual-latent
representation that uses per-frame geometry latents and reference-anchored track
latents as dense queries, and (ii) temporal RoPE alignment, which specifies the
target timestamp of each track latent. Together, these designs convert the per-frame
generative paradigm of video DiTs into a reference-anchored tracking formulation
with LoRA fine-tuning. TrackCraft3R achieves state-of-the-art performance on
standard sparse and dense 3D tracking benchmarks, while running 1.3×faster
and using 4.6×less peak memory than the strongest prior method. We further
demonstrate robustness to large motions and long videos.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.12587

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.12587 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.12587 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.12587 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers