Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.</p>\n","updatedAt":"2026-06-16T05:27:54.063Z","author":{"_id":"6373037cd3ba6bd3f9bc32fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6373037cd3ba6bd3f9bc32fa/dIL4AIcL24QL_Rp4B9ewS.png","fullname":"Shuai Yang","name":"ysmikey","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9016164541244507},"editors":["ysmikey"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6373037cd3ba6bd3f9bc32fa/dIL4AIcL24QL_Rp4B9ewS.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.16449","authors":[{"_id":"6a30cab5a0d4daae4286018d","name":"Shuai Yang","hidden":false},{"_id":"6a30cab5a0d4daae4286018e","name":"Bingjie Gao","hidden":false},{"_id":"6a30cab5a0d4daae4286018f","name":"Ziwei Liu","hidden":false},{"_id":"6a30cab5a0d4daae42860190","name":"Jiaqi Wang","hidden":false},{"_id":"6a30cab5a0d4daae42860191","name":"Dahua Lin","hidden":false},{"_id":"6a30cab5a0d4daae42860192","name":"Tong Wu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6373037cd3ba6bd3f9bc32fa/1pL9x-rc1x7GWRwF0-_oC.mp4"],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory","submittedOnDailyBy":{"_id":"6373037cd3ba6bd3f9bc32fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6373037cd3ba6bd3f9bc32fa/dIL4AIcL24QL_Rp4B9ewS.png","isPro":false,"fullname":"Shuai Yang","user":"ysmikey","type":"user","name":"ysmikey"},"summary":"Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.","upvotes":1,"discussionId":"6a30cab6a0d4daae42860193","projectPage":"https://ys-imtech.github.io/projects/PermaVid/","githubRepo":"https://github.com/YS-IMTech/PermaVid","githubRepoAddedBy":"user","ai_summary":"PermaVid addresses long-term video consistency after edits by using multi-modal memory banks that separate appearance and geometric structure, enabling coherent video generation across time and viewpoints.","ai_keywords":["multi-modal context memory","spatial context","semantic appearance","geometric structure","edit-aware memory update","memory retrieval strategy","RGB context memory","depth context memory","multi-modal feature fusion","memory-guided video generation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6373037cd3ba6bd3f9bc32fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6373037cd3ba6bd3f9bc32fa/dIL4AIcL24QL_Rp4B9ewS.png","isPro":false,"fullname":"Shuai Yang","user":"ysmikey","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.16449.md","query":{}}">
PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory
Abstract
PermaVid addresses long-term video consistency after edits by using multi-modal memory banks that separate appearance and geometric structure, enabling coherent video generation across time and viewpoints.
Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.
Community
Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.16449 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.16449 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.16449 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.