Hugging Face Daily Papers · June 2, 2026 · 6 min read

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away.\nWe address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes.\nExperiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory.\n","updatedAt":"2026-06-02T06:38:37.270Z","author":{"_id":"656db3f53dc1d277e5a64410","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656db3f53dc1d277e5a64410/9kiY2K3MCRcBDk7MrkTBK.png","fullname":"Wei Huang","name":"AaronHuangWei","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":15,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8167082667350769},"editors":["AaronHuangWei"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/656db3f53dc1d277e5a64410/9kiY2K3MCRcBDk7MrkTBK.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.02553","authors":[{"_id":"6a1e59a9808ddbc3c7d43df7","name":"Qixin Hu","hidden":false},{"_id":"6a1e59a9808ddbc3c7d43df8","name":"Shuai Yang","hidden":false},{"_id":"6a1e59a9808ddbc3c7d43df9","name":"Wei Huang","hidden":false},{"_id":"6a1e59a9808ddbc3c7d43dfa","name":"Song Han","hidden":false},{"_id":"6a1e59a9808ddbc3c7d43dfb","name":"Yukang Chen","hidden":false}],"publishedAt":"2026-06-01T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation","submittedOnDailyBy":{"_id":"656db3f53dc1d277e5a64410","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656db3f53dc1d277e5a64410/9kiY2K3MCRcBDk7MrkTBK.png","isPro":false,"fullname":"Wei Huang","user":"AaronHuangWei","type":"user","name":"AaronHuangWei"},"summary":"Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at https://github.com/qixinhu11/LongLive-RAG.","upvotes":8,"discussionId":"6a1e59a9808ddbc3c7d43dfc","projectPage":"http://longlive-rag.github.io/","githubRepo":"https://github.com/qixinhu11/LongLive-RAG","githubRepoAddedBy":"user","ai_summary":"LongLive-RAG addresses long-video generation challenges by using retrieval-augmented generation to overcome error accumulation from sliding-window attention, enabling better temporal coherence and quality.","ai_keywords":["autoregressive video diffusion","sliding-window attention","retrieval-augmented generation","latent history","query embedding","temporal delta loss","video generation","error accumulation","content-addressable memory"],"githubStars":6,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6835f213acb69590fedd0f90","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ueLnJIRsK3_n0yaOCwJA8.png","isPro":false,"fullname":"Qixin Hu","user":"qixinhu11","type":"user"},{"_id":"6a0f723ef6eb5ce4d85e4619","avatarUrl":"/avatars/f32b294caa4e2d0f56d461f7431a56ad.svg","isPro":false,"fullname":"Qixin Hu","user":"qqqqiiii-amazon","type":"user"},{"_id":"656db3f53dc1d277e5a64410","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/656db3f53dc1d277e5a64410/9kiY2K3MCRcBDk7MrkTBK.png","isPro":false,"fullname":"Wei Huang","user":"AaronHuangWei","type":"user"},{"_id":"6474459d33192631bacc2666","avatarUrl":"/avatars/3cc2844705422b078e9e36328078b1f3.svg","isPro":false,"fullname":"mao","user":"WeianMao","type":"user"},{"_id":"656a12a3d848a6683a6dfb9e","avatarUrl":"/avatars/5ea6ed75051e65da76d89b4a649b09a2.svg","isPro":false,"fullname":"YC Xiao","user":"EasonXiao-888","type":"user"},{"_id":"69bcdc0c4c7110a367b14d27","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/kgY5J0FF2Sd2YE36murpA.png","isPro":false,"fullname":"Allen Ezra","user":"thomas-young38","type":"user"},{"_id":"634ce90e741a5e37886a19e3","avatarUrl":"/avatars/0d1579039136b37db5b67282b0a34c33.svg","isPro":false,"fullname":"Syang","user":"Andyson","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.02553.md"}">

Papers

arxiv:2606.02553

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Published on Jun 1

· Submitted by

Wei Huang on Jun 2

NVIDIA

Upvote

Authors:

Abstract

LongLive-RAG addresses long-video generation challenges by using retrieval-augmented generation to overcome error accumulation from sliding-window attention, enabling better temporal coherence and quality.

AI-generated summary

View arXiv page View PDF Project page GitHub 6 Add to collection

Community

AaronHuangWei

Paper submitter about 3 hours ago

We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes.

Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.02553

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.02553 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.02553 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

LongLive-RAG: A General Retrieval-Augmented Framework for Long Video Generation

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers