Hugging Face Daily Papers · June 23, 2026 · 3 min read

Vera: A Layered Diffusion Model for Content-Preserving Video Editing

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We present Vera, a layered video diffusion model. Vera generates only what needs to change as separate edit layers while leaving the rest of the video untouched, preserving the identities, performances, and other details from the source footage exactly as filmed.</p>\n<p>Learn more from our project website: <a href=\"https://vera-layered-diffusion.github.io/\" rel=\"nofollow\">https://vera-layered-diffusion.github.io/</a>.</p>\n","updatedAt":"2026-06-23T15:26:10.547Z","author":{"_id":"65981d46966fb200b4ef3424","avatarUrl":"/avatars/7ef66c0f80a402ca2359d5ff02c71032.svg","fullname":"Zhuoning Yuan","name":"yzhuoning","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8738977909088135},"editors":["yzhuoning"],"editorAvatarUrls":["/avatars/7ef66c0f80a402ca2359d5ff02c71032.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.23610","authors":[{"_id":"6a39ffcffdcd3514343bb55d","name":"Hongkai Zheng","hidden":false},{"_id":"6a39ffcffdcd3514343bb55e","name":"Ta-Ying Cheng","hidden":false},{"_id":"6a39ffcffdcd3514343bb55f","name":"Benjamin Klein","hidden":false},{"_id":"6a39ffcffdcd3514343bb560","name":"Yisong Yue","hidden":false},{"_id":"6a39ffcffdcd3514343bb561","user":{"_id":"65981d46966fb200b4ef3424","avatarUrl":"/avatars/7ef66c0f80a402ca2359d5ff02c71032.svg","isPro":false,"fullname":"Zhuoning Yuan","user":"yzhuoning","type":"user","name":"yzhuoning"},"name":"Zhuoning Yuan","status":"claimed_verified","statusLastChangedAt":"2026-06-23T13:56:34.586Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65981d46966fb200b4ef3424/Fjb1NmDFmN2C8kHPZcwsT.mp4"],"publishedAt":"2026-06-22T00:00:00.000Z","submittedOnDailyAt":"2026-06-23T00:00:00.000Z","title":"Vera: A Layered Diffusion Model for Content-Preserving Video Editing","submittedOnDailyBy":{"_id":"65981d46966fb200b4ef3424","avatarUrl":"/avatars/7ef66c0f80a402ca2359d5ff02c71032.svg","isPro":false,"fullname":"Zhuoning Yuan","user":"yzhuoning","type":"user","name":"yzhuoning"},"summary":"Video diffusion models have enabled remarkable progress in video generation and editing. However, content preservation remains a core challenge: existing methods regenerate every pixel and often alter elements that should remain unchanged, such as characters or background scenes. We introduce Vera, a layered diffusion framework for content-preserving video editing. Instead of regenerating the entire video, Vera generates an edit layer along with an alpha matte for compositing with the source video, separating creative editing from content preservation by design. To encourage coherent composition with the source video, we extend the text-to-video DiT into a Mixture-of-Transformers (MoT) architecture, with separate DiTs for each layer that interact through joint self-attention. To support the training of Vera, we further construct a high-quality layered dataset with accurate alpha mattes, diverse scenes and dynamics, and visual effects. Across our quantitative benchmark and human preference study, Vera outperforms leading open-source video editing models in content preservation while remaining competitive in edit quality, using 486K frames of layered training data.","upvotes":4,"discussionId":"6a39ffcffdcd3514343bb562","projectPage":"https://vera-layered-diffusion.github.io/","ai_summary":"Vera is a layered diffusion framework that preserves video content during editing by generating edit layers and alpha mattes through a Mixture-of-Transformers architecture.","ai_keywords":["video diffusion models","content preservation","layered diffusion framework","edit layer","alpha matte","compositing","text-to-video DiT","Mixture-of-Transformers","MoT architecture","joint self-attention","layered dataset","visual effects"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"687592798586825b86976d6d","name":"netflix","fullname":"Netflix","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68d2e3609537cbef66612ae8/MJuH26GOdHAZYObE9QOYA.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65981d46966fb200b4ef3424","avatarUrl":"/avatars/7ef66c0f80a402ca2359d5ff02c71032.svg","isPro":false,"fullname":"Zhuoning Yuan","user":"yzhuoning","type":"user"},{"_id":"641152bee28eb8449fa65417","avatarUrl":"/avatars/552b9b8dd1314afb944238a2382101cd.svg","isPro":false,"fullname":"Hongkai Zheng","user":"hzzheng","type":"user"},{"_id":"64b7ae216ab5d14ca7febde0","avatarUrl":"/avatars/bb18ece0b50eb72dd6df6e32e5051d52.svg","isPro":false,"fullname":"Orest Kupyn","user":"okupyn","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"687592798586825b86976d6d","name":"netflix","fullname":"Netflix","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68d2e3609537cbef66612ae8/MJuH26GOdHAZYObE9QOYA.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.23610.md","query":{}}">

Papers

arxiv:2606.23610

Vera: A Layered Diffusion Model for Content-Preserving Video Editing

Published on Jun 22

· Submitted by

Zhuoning Yuan on Jun 23

Netflix

Upvote

Authors:

Zhuoning Yuan

Abstract

Vera is a layered diffusion framework that preserves video content during editing by generating edit layers and alpha mattes through a Mixture-of-Transformers architecture.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Video diffusion models have enabled remarkable progress in video generation and editing. However, content preservation remains a core challenge: existing methods regenerate every pixel and often alter elements that should remain unchanged, such as characters or background scenes. We introduce Vera, a layered diffusion framework for content-preserving video editing. Instead of regenerating the entire video, Vera generates an edit layer along with an alpha matte for compositing with the source video, separating creative editing from content preservation by design. To encourage coherent composition with the source video, we extend the text-to-video DiT into a Mixture-of-Transformers (MoT) architecture, with separate DiTs for each layer that interact through joint self-attention. To support the training of Vera, we further construct a high-quality layered dataset with accurate alpha mattes, diverse scenes and dynamics, and visual effects. Across our quantitative benchmark and human preference study, Vera outperforms leading open-source video editing models in content preservation while remaining competitive in edit quality, using 486K frames of layered training data.

View arXiv page View PDF Project page Add to collection

Community

yzhuoning

Paper author Paper submitter about 10 hours ago

Learn more from our project website: https://vera-layered-diffusion.github.io/.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.23610

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.23610 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.23610 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

Vera: A Layered Diffusion Model for Content-Preserving Video Editing

Abstract

Community

Models citing this paper 0

Datasets citing this paper 1

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers