Hugging Face Daily Papers · · 5 min read

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.</p>\n","updatedAt":"2026-06-16T05:27:54.063Z","author":{"_id":"6373037cd3ba6bd3f9bc32fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6373037cd3ba6bd3f9bc32fa/dIL4AIcL24QL_Rp4B9ewS.png","fullname":"Shuai Yang","name":"ysmikey","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":7,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9016164541244507},"editors":["ysmikey"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6373037cd3ba6bd3f9bc32fa/dIL4AIcL24QL_Rp4B9ewS.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.16449","authors":[{"_id":"6a30cab5a0d4daae4286018d","name":"Shuai Yang","hidden":false},{"_id":"6a30cab5a0d4daae4286018e","name":"Bingjie Gao","hidden":false},{"_id":"6a30cab5a0d4daae4286018f","name":"Ziwei Liu","hidden":false},{"_id":"6a30cab5a0d4daae42860190","name":"Jiaqi Wang","hidden":false},{"_id":"6a30cab5a0d4daae42860191","name":"Dahua Lin","hidden":false},{"_id":"6a30cab5a0d4daae42860192","name":"Tong Wu","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6373037cd3ba6bd3f9bc32fa/1pL9x-rc1x7GWRwF0-_oC.mp4"],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory","submittedOnDailyBy":{"_id":"6373037cd3ba6bd3f9bc32fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6373037cd3ba6bd3f9bc32fa/dIL4AIcL24QL_Rp4B9ewS.png","isPro":false,"fullname":"Shuai Yang","user":"ysmikey","type":"user","name":"ysmikey"},"summary":"Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.","upvotes":1,"discussionId":"6a30cab6a0d4daae42860193","projectPage":"https://ys-imtech.github.io/projects/PermaVid/","githubRepo":"https://github.com/YS-IMTech/PermaVid","githubRepoAddedBy":"user","ai_summary":"PermaVid addresses long-term video consistency after edits by using multi-modal memory banks that separate appearance and geometric structure, enabling coherent video generation across time and viewpoints.","ai_keywords":["multi-modal context memory","spatial context","semantic appearance","geometric structure","edit-aware memory update","memory retrieval strategy","RGB context memory","depth context memory","multi-modal feature fusion","memory-guided video generation"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6373037cd3ba6bd3f9bc32fa","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6373037cd3ba6bd3f9bc32fa/dIL4AIcL24QL_Rp4B9ewS.png","isPro":false,"fullname":"Shuai Yang","user":"ysmikey","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.16449.md","query":{}}">
Papers
arxiv:2606.16449

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

Published on Jun 15
· Submitted by
Shuai Yang
on Jun 16
Authors:
,
,
,
,
,

Abstract

PermaVid addresses long-term video consistency after edits by using multi-modal memory banks that separate appearance and geometric structure, enabling coherent video generation across time and viewpoints.

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

Community

Paper submitter about 8 hours ago

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.16449
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.16449 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.16449 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.16449 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers