Hugging Face Daily Papers · · 7 min read

AdaState: Self-Evolving Anchors for Streaming Video Generation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Streaming video diffusion models have a structural blind spot: they anchor on the first frame. Because that frame sits in the cleanest, most error-free slot of the KV cache, attention collapses onto this reference; suppressing dynamics and locking the scene composition even as the rollout progresses.</p>\n<p>AdaState replaces this static anchor with a self-evolving one. We reserve a hidden latent slot inside the KV cache that the model denoises alongside each chunk but never renders as a frame. At every step, the model generates its own scene anchor by attending to both the previous state and the current content, so the reference evolves with the video and stays temporally continuous with the chunk being generated.</p>\n","updatedAt":"2026-05-29T02:45:37.663Z","author":{"_id":"65454d7c117ecae648892170","avatarUrl":"/avatars/83a7091a24bf86801176ca85234b417a.svg","fullname":"Yusuf Dalva","name":"ydalva","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":5,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8983185887336731},"editors":["ydalva"],"editorAvatarUrls":["/avatars/83a7091a24bf86801176ca85234b417a.svg"],"reactions":[],"isReport":false}},{"id":"6a1a40b4faffc432131057c2","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:43:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion](https://huggingface.co/papers/2605.30351) (2026)\n* [CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives](https://huggingface.co/papers/2605.12496) (2026)\n* [Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis](https://huggingface.co/papers/2604.06939) (2026)\n* [DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation](https://huggingface.co/papers/2605.21028) (2026)\n* [Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity](https://huggingface.co/papers/2605.14487) (2026)\n* [RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO](https://huggingface.co/papers/2605.15190) (2026)\n* [StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration](https://huggingface.co/papers/2605.25659) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2605.30351\">VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12496\">CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2604.06939\">Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.21028\">DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.14487\">Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.15190\">RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.25659\">StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{&quot;user&quot;:&quot;librarian-bot&quot;}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:43:16.729Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6889835000038147},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30349","authors":[{"_id":"6a18fc4056b4bb14ec65cecd","user":{"_id":"65454d7c117ecae648892170","avatarUrl":"/avatars/83a7091a24bf86801176ca85234b417a.svg","isPro":true,"fullname":"Yusuf Dalva","user":"ydalva","type":"user","name":"ydalva"},"name":"Yusuf Dalva","status":"claimed_verified","statusLastChangedAt":"2026-05-29T08:51:00.789Z","hidden":false},{"_id":"6a18fc4056b4bb14ec65cece","name":"Pinar Yanardag","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65454d7c117ecae648892170/drWeTTKRYUY4ZQwEfZ8hC.mp4"],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"AdaState: Self-Evolving Anchors for Streaming Video Generation","submittedOnDailyBy":{"_id":"65454d7c117ecae648892170","avatarUrl":"/avatars/83a7091a24bf86801176ca85234b417a.svg","isPro":true,"fullname":"Yusuf Dalva","user":"ydalva","type":"user","name":"ydalva"},"summary":"Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and serves as the primary scene reference throughout generation. As the cleanest and most error-free position in the cache, this anchor draws disproportionate attention, suppressing video dynamics, and locking scene composition to the initial viewpoint even as the scene naturally evolves. The result is a temporally shallow video in which motion, camera movement, and scene progression are dampened in favor of static consistency. To address this, we replace the static anchor with an adaptive state, a hidden latent that the model denoises alongside content at every chunk but never renders. Rather than referencing a frozen first frame, the model generates its own scene anchor at each step by attending to both the previous state and the current content, producing a reference that evolves with the generated content. Unlike standard video generation, which encodes an absolute notion of time, our formulation treats time as relative: every generation step sees the same positional structure regardless of how far generation has progressed, and the state transition is identical at every chunk. Together, these properties introduce a recurrence into the generation process, where denoising serves as the transition function, and the KV cache serves as the carrier, requiring no external module. Experiments demonstrate that the adaptive state substantially improves video dynamics, enabling richer motion and natural scene progression within generated videos.","upvotes":4,"discussionId":"6a18fc4056b4bb14ec65cecf","projectPage":"https://adastate.github.io/","ai_summary":"Video diffusion models with adaptive state replacement generate more dynamic videos by evolving scene references rather than fixing to initial frames, using recurrent denoising as transition function.","ai_keywords":["autoregressive video diffusion models","attention cache","video dynamics","scene composition","adaptive state","latent denoising","positional structure","temporal shallow video","recurrence","KV cache"],"organization":{"_id":"6877c8adc38b08df75abb42c","name":"mayzovt","fullname":"Virginia Tech","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/683f717733a4cbbecbdd6cfa/HFBWbwPhKhAhTTE4F6hEA.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"683b9c7d4568ffd33e7334e1","avatarUrl":"/avatars/046bd5a78a24f3bc4e1f6cb6076970e0.svg","isPro":false,"fullname":"Pinar Yanardag","user":"Pinguar","type":"user"},{"_id":"6330f570b68c7453d2ee8945","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6330f570b68c7453d2ee8945/PwmHigwGHsSAl3v6QCWe8.jpeg","isPro":false,"fullname":"Tsai-Shien Chen","user":"tsaishien-chen","type":"user"},{"_id":"6570450a78d7aca0c361a177","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6570450a78d7aca0c361a177/MX7jHhTQwLs-BvYIu5rqb.jpeg","isPro":false,"fullname":"Harold Chen","user":"Harold328","type":"user"},{"_id":"648961d150c003881f1a10c3","avatarUrl":"/avatars/1eb3784c39f7ced2e952d11a410933ae.svg","isPro":false,"fullname":"Harshita Sharma","user":"hdsharma","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6877c8adc38b08df75abb42c","name":"mayzovt","fullname":"Virginia Tech","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/683f717733a4cbbecbdd6cfa/HFBWbwPhKhAhTTE4F6hEA.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30349.md"}">
Papers
arxiv:2605.30349

AdaState: Self-Evolving Anchors for Streaming Video Generation

Published on May 28
· Submitted by
Yusuf Dalva
on May 29
Authors:

Abstract

Video diffusion models with adaptive state replacement generate more dynamic videos by evolving scene references rather than fixing to initial frames, using recurrent denoising as transition function.

AI-generated summary

Autoregressive video diffusion models generate streaming video by producing frames sequentially, conditioning each chunk on previously generated content. These models are structurally anchored to the first frame: its key-value representation occupies a privileged position in the attention cache and serves as the primary scene reference throughout generation. As the cleanest and most error-free position in the cache, this anchor draws disproportionate attention, suppressing video dynamics, and locking scene composition to the initial viewpoint even as the scene naturally evolves. The result is a temporally shallow video in which motion, camera movement, and scene progression are dampened in favor of static consistency. To address this, we replace the static anchor with an adaptive state, a hidden latent that the model denoises alongside content at every chunk but never renders. Rather than referencing a frozen first frame, the model generates its own scene anchor at each step by attending to both the previous state and the current content, producing a reference that evolves with the generated content. Unlike standard video generation, which encodes an absolute notion of time, our formulation treats time as relative: every generation step sees the same positional structure regardless of how far generation has progressed, and the state transition is identical at every chunk. Together, these properties introduce a recurrence into the generation process, where denoising serves as the transition function, and the KV cache serves as the carrier, requiring no external module. Experiments demonstrate that the adaptive state substantially improves video dynamics, enabling richer motion and natural scene progression within generated videos.

Community

Paper author Paper submitter 1 day ago

Streaming video diffusion models have a structural blind spot: they anchor on the first frame. Because that frame sits in the cleanest, most error-free slot of the KV cache, attention collapses onto this reference; suppressing dynamics and locking the scene composition even as the rollout progresses.

AdaState replaces this static anchor with a self-evolving one. We reserve a hidden latent slot inside the KV cache that the model denoises alongside each chunk but never renders as a frame. At every step, the model generates its own scene anchor by attending to both the previous state and the current content, so the reference evolves with the video and stays temporally continuous with the chunk being generated.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30349
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30349 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30349 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30349 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers