Hugging Face Daily Papers · June 16, 2026 · 4 min read

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<video src=\"https://cdn-uploads.huggingface.co/production/uploads/6487e839341b4c7ec11013e5/GiHHJFKjQXa0AKUCMAM8I.mp4\" controls=\"\" class=\"max-w-full!\"></video></p>","updatedAt":"2026-06-16T15:58:20.058Z","author":{"_id":"6487e839341b4c7ec11013e5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6487e839341b4c7ec11013e5/QxF1MmaGm67ZREYs1TYvn.jpeg","fullname":"Feng Qiao","name":"FQiao","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.36237138509750366},"editors":["FQiao"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6487e839341b4c7ec11013e5/QxF1MmaGm67ZREYs1TYvn.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.15534","authors":[{"_id":"6a317205bc818ff14e453b68","user":{"_id":"6487e839341b4c7ec11013e5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6487e839341b4c7ec11013e5/QxF1MmaGm67ZREYs1TYvn.jpeg","isPro":false,"fullname":"Feng Qiao","user":"FQiao","type":"user","name":"FQiao"},"name":"Feng Qiao","status":"claimed_verified","statusLastChangedAt":"2026-06-16T16:14:16.295Z","hidden":false},{"_id":"6a317205bc818ff14e453b69","name":"Zhaochong An","hidden":false},{"_id":"6a317205bc818ff14e453b6a","name":"Zhexiao Xiong","hidden":false},{"_id":"6a317205bc818ff14e453b6b","name":"Serge Belongie","hidden":false},{"_id":"6a317205bc818ff14e453b6c","name":"Nathan Jacobs","hidden":false}],"publishedAt":"2026-06-14T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks","submittedOnDailyBy":{"_id":"6487e839341b4c7ec11013e5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6487e839341b4c7ec11013e5/QxF1MmaGm67ZREYs1TYvn.jpeg","isPro":false,"fullname":"Feng Qiao","user":"FQiao","type":"user","name":"FQiao"},"summary":"Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: https://qjizhi.github.io/track2view","upvotes":3,"discussionId":"6a317205bc818ff14e453b6d","projectPage":"https://qjizhi.github.io/track2view/","ai_summary":"Track2View generates novel camera viewpoints from videos by using 3D point tracks to establish explicit spatiotemporal correspondences, achieving superior visual quality and camera accuracy compared to existing methods.","ai_keywords":["video diffusion transformer","3D point tracks","dual-view track conditioner","temporal aggregation","camera trajectory","visual context transfer","spatiotemporal correspondences","3D point tracker","multi-camera view pairs"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"652fef7c8f61edf213bbbab2","name":"MVRL","fullname":"Multimodal Vision Research Laboratory @ WashU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/652ffca9729ec1a37e4e7915/P94G4gdINp2-T2IImrUg5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6487e839341b4c7ec11013e5","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6487e839341b4c7ec11013e5/QxF1MmaGm67ZREYs1TYvn.jpeg","isPro":false,"fullname":"Feng Qiao","user":"FQiao","type":"user"},{"_id":"6a318e8ef180cdc3e7862eaa","avatarUrl":"/avatars/c19cfb3b4f9f4a0e4faaf4c4077fdaf1.svg","isPro":false,"fullname":"Abbyhe","user":"Abbyhe","type":"user"},{"_id":"65e5eae6958b39864e8b683e","avatarUrl":"/avatars/b6a857e7b725767197dd95bc876f8ad1.svg","isPro":false,"fullname":"Zhaochong An","user":"ZhaochongAn","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"652fef7c8f61edf213bbbab2","name":"MVRL","fullname":"Multimodal Vision Research Laboratory @ WashU","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/652ffca9729ec1a37e4e7915/P94G4gdINp2-T2IImrUg5.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.15534.md","query":{}}">

Papers

arxiv:2606.15534

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

Published on Jun 14

· Submitted by

Feng Qiao on Jun 16

Multimodal Vision Research Laboratory @ WashU

Upvote

Authors:

Feng Qiao ,

Abstract

Track2View generates novel camera viewpoints from videos by using 3D point tracks to establish explicit spatiotemporal correspondences, achieving superior visual quality and camera accuracy compared to existing methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: https://qjizhi.github.io/track2view

View arXiv page View PDF Project page Add to collection

Community

FQiao

Paper author Paper submitter about 4 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.15534

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.15534 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.15534 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.15534 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers