A fast method to generate a 4D mesh from video. It takes 9 seconds (x13 faster than prior work) to generate a topology-consistent 4D mesh from a 16-frame video. Our approach also scales to videos up to 16× longer without degrading mesh quality. The approach keeps the mesh grounded to the input video, allowing downstream 2D/4D tracking, camera estimation, and 4D object placement.</p>\n","updatedAt":"2026-05-20T15:23:40.765Z","author":{"_id":"630f0d48982455e61cc4cc08","avatarUrl":"/avatars/eea6ed2e112e830effa98a4661c5474f.svg","fullname":"Samuel","name":"Dvir","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8446773290634155},"editors":["Dvir"],"editorAvatarUrls":["/avatars/eea6ed2e112e830effa98a4661c5474f.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.19786","authors":[{"_id":"6a0dd1bad1ef9ecdf71c0de5","name":"Dvir Samuel","hidden":false},{"_id":"6a0dd1bad1ef9ecdf71c0de6","name":"Yuval Atzmon","hidden":false},{"_id":"6a0dd1bad1ef9ecdf71c0de7","name":"Gal Chechik","hidden":false},{"_id":"6a0dd1bad1ef9ecdf71c0de8","name":"Yoni Kasten","hidden":false}],"publishedAt":"2026-05-19T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Fast 4D Mesh Generation by Spatio-Temporal Attention Chains","submittedOnDailyBy":{"_id":"630f0d48982455e61cc4cc08","avatarUrl":"/avatars/eea6ed2e112e830effa98a4661c5474f.svg","isPro":false,"fullname":"Samuel","user":"Dvir","type":"user","name":"Dvir"},"summary":"4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency.\n Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a 13times speedup while producing higher-quality results. Moreover, our approach scales to videos up to 16times longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.","upvotes":7,"discussionId":"6a0dd1bad1ef9ecdf71c0de9","projectPage":"https://research.nvidia.com/labs/par/fast4dmesh/","ai_summary":"A training-free 4D mesh generation approach uses spatio-temporal attention chains to accelerate mesh creation while improving temporal correspondence quality and enabling scalable long-sequence processing.","ai_keywords":["4D mesh generation","spatio-temporal attention chain","temporal correspondences","latent tokens","latent-to-vertex attention","anchor mesh","2D object tracking","4D tracking","camera estimation"],"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630f0d48982455e61cc4cc08","avatarUrl":"/avatars/eea6ed2e112e830effa98a4661c5474f.svg","isPro":false,"fullname":"Samuel","user":"Dvir","type":"user"},{"_id":"63ca8e060609f1def7e6548a","avatarUrl":"/avatars/1da7947840cb87d5f77c0af9ee11f9c2.svg","isPro":true,"fullname":"Yi Jung","user":"YJ-142150","type":"user"},{"_id":"6348033a27bd5edd9822011a","avatarUrl":"/avatars/2779042db25efd62a98aa42fdc11a3cb.svg","isPro":false,"fullname":"Netanel Tamir","user":"NetanelTamir","type":"user"},{"_id":"6671e6facda5ebe22f42b517","avatarUrl":"/avatars/f93089686656802131ae22ede9a3aed5.svg","isPro":false,"fullname":"Imri Shuval","user":"Imri-sh","type":"user"},{"_id":"64c5f22c2581696666ebed88","avatarUrl":"/avatars/e85cd2d82f16ec10cad2b63929b2f05a.svg","isPro":false,"fullname":"Rami Ben-Ari","user":"ramiben","type":"user"},{"_id":"630488f55d136debceca5bdd","avatarUrl":"/avatars/7eb5db358992648198e6f566b98681e8.svg","isPro":false,"fullname":"Matan Levy","user":"Matanl","type":"user"},{"_id":"689e23d2d310cc01ce25ed05","avatarUrl":"/avatars/c273c2e119d86583dbea5a0178968204.svg","isPro":false,"fullname":"Noa Barzilay","user":"NoaBarzilay","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.19786.md"}">
Fast 4D Mesh Generation by Spatio-Temporal Attention Chains
Published on May 19
· Submitted by Samuel on May 20 Abstract
A training-free 4D mesh generation approach uses spatio-temporal attention chains to accelerate mesh creation while improving temporal correspondence quality and enabling scalable long-sequence processing.
AI-generated summary
4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency.
Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a 13times speedup while producing higher-quality results. Moreover, our approach scales to videos up to 16times longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.
Community
A fast method to generate a 4D mesh from video. It takes 9 seconds (x13 faster than prior work) to generate a topology-consistent 4D mesh from a 16-frame video. Our approach also scales to videos up to 16× longer without degrading mesh quality. The approach keeps the mesh grounded to the input video, allowing downstream 2D/4D tracking, camera estimation, and 4D object placement.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.19786 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.19786 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.19786 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.