<a href=\"https://arxiv.org/pdf/2605.15186\" rel=\"nofollow\">https://arxiv.org/pdf/2605.15186</a></p>\n","updatedAt":"2026-05-15T02:58:46.356Z","author":{"_id":"6552f1ad5d55ccb20e9142a0","avatarUrl":"/avatars/0e3e80cba64b5ae0bc5638694ac33dbf.svg","fullname":"Ivan Tang","name":"IvanTang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.559170126914978},"editors":["IvanTang"],"editorAvatarUrls":["/avatars/0e3e80cba64b5ae0bc5638694ac33dbf.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.15186","authors":[{"_id":"6a068bc8b1a8cbabc9f09932","name":"Kaixin Zhu","hidden":false},{"_id":"6a068bc8b1a8cbabc9f09933","name":"Yiwen Tang","hidden":false},{"_id":"6a068bc8b1a8cbabc9f09934","name":"Yifan Yang","hidden":false},{"_id":"6a068bc8b1a8cbabc9f09935","name":"Renrui Zhang","hidden":false},{"_id":"6a068bc8b1a8cbabc9f09936","name":"Bohan Zeng","hidden":false},{"_id":"6a068bc8b1a8cbabc9f09937","name":"Ziyu Guo","hidden":false},{"_id":"6a068bc8b1a8cbabc9f09938","name":"Ruichuan An","hidden":false},{"_id":"6a068bc8b1a8cbabc9f09939","name":"Zhou Liu","hidden":false},{"_id":"6a068bc8b1a8cbabc9f0993a","name":"Qizhi Chen","hidden":false},{"_id":"6a068bc8b1a8cbabc9f0993b","name":"Delin Qu","hidden":false},{"_id":"6a068bc8b1a8cbabc9f0993c","name":"Jaehong Yoon","hidden":false},{"_id":"6a068bc8b1a8cbabc9f0993d","name":"Wentao Zhang","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction","submittedOnDailyBy":{"_id":"6552f1ad5d55ccb20e9142a0","avatarUrl":"/avatars/0e3e80cba64b5ae0bc5638694ac33dbf.svg","isPro":false,"fullname":"Ivan Tang","user":"IvanTang","type":"user","name":"IvanTang"},"summary":"High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.","upvotes":12,"discussionId":"6a068bc9b1a8cbabc9f0993e","ai_summary":"VGGT-Edit enables text-conditioned 3D scene editing through depth-synchronized text injection and direct geometric displacement prediction, achieving superior quality and efficiency over 2D-lifting approaches.","ai_keywords":["feed-forward architectures","3D scene reconstruction","text-conditioned editing","depth-synchronized text injection","residual transformation head","geometric displacements","multi-term objective function","cross-view consistency","DeltaScene Dataset","2D-lifting strategy"],"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6552f1ad5d55ccb20e9142a0","avatarUrl":"/avatars/0e3e80cba64b5ae0bc5638694ac33dbf.svg","isPro":false,"fullname":"Ivan Tang","user":"IvanTang","type":"user"},{"_id":"67b85c5316d8e91064b2e75d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67b85c5316d8e91064b2e75d/KtuPjToHZdH6IgHkDkU3H.jpeg","isPro":false,"fullname":"YYY","user":"YY222","type":"user"},{"_id":"647d9ab61a1fcad2fdbf2d3d","avatarUrl":"/avatars/48c8aeae8979d2c87df8bde922437d62.svg","isPro":true,"fullname":"Ziyu Guo","user":"ZiyuG","type":"user"},{"_id":"6a068e0e110e462ad67f7832","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/xawS3k7T4SKJwX6Ky3zIt.jpeg","isPro":false,"fullname":"余翱","user":"unityuao","type":"user"},{"_id":"6a069074891abe09466f3505","avatarUrl":"/avatars/edfdf4fc92d4b1d6f15a55cb5d85ea6d.svg","isPro":false,"fullname":"Y","user":"YZYClub","type":"user"},{"_id":"6671214c92412fd4640714eb","avatarUrl":"/avatars/48fa84e7bc3bb92ad0192aa26b32de10.svg","isPro":false,"fullname":"bohan zeng","user":"zbhpku","type":"user"},{"_id":"6a06927ca241900da8d782ca","avatarUrl":"/avatars/ea824b6545254cd1f3a7d880e66a41ce.svg","isPro":false,"fullname":"quyuxin","user":"quyuxin","type":"user"},{"_id":"69d489e6c931c7d87d698471","avatarUrl":"/avatars/31add9296da6d8043b9e470f177ec688.svg","isPro":false,"fullname":"Eva","user":"SF06","type":"user"},{"_id":"6a06b10b38e4ad27f9fe542d","avatarUrl":"/avatars/775c8af6640d6401f511d842eb8e0e6c.svg","isPro":false,"fullname":"wlkq","user":"wlkq-1124","type":"user"},{"_id":"6a06bd9906f36fedeb6cf185","avatarUrl":"/avatars/b1c26132dd73379b3abd609db432bd4c.svg","isPro":false,"fullname":"liu-quan","user":"quan31liu","type":"user"},{"_id":"673c7319d11b1c2e246ead9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/673c7319d11b1c2e246ead9c/IjFIO--N7Hm_BOEafhEQv.jpeg","isPro":false,"fullname":"Yang Shi","user":"DogNeverSleep","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.15186.md"}">
VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction
Authors: ,
,
,
,
,
,
,
,
,
,
,
Abstract
VGGT-Edit enables text-conditioned 3D scene editing through depth-synchronized text injection and direct geometric displacement prediction, achieving superior quality and efficiency over 2D-lifting approaches.
AI-generated summary
High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.15186 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.15186 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.15186 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.