Hugging Face Daily Papers · · 7 min read

LVSA: Training-Free Sparse Attention for Long Video Diffusion

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, \"frozen\" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.</p>\n","updatedAt":"2026-06-02T08:47:17.619Z","author":{"_id":"65f840681b8b18371208be4a","avatarUrl":"/avatars/b8df4f5be2519289558c3c8271d5104f.svg","fullname":"Gaël Glorian","name":"gglorian","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8581310510635376},"editors":["gglorian"],"editorAvatarUrls":["/avatars/b8df4f5be2519289558c3c8271d5104f.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31057","authors":[{"_id":"6a1d2d3e808ddbc3c7d4369c","user":{"_id":"65f840681b8b18371208be4a","avatarUrl":"/avatars/b8df4f5be2519289558c3c8271d5104f.svg","isPro":false,"fullname":"Gaël Glorian","user":"gglorian","type":"user","name":"gglorian"},"name":"Gael Glorian","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:32:04.050Z","hidden":false},{"_id":"6a1d2d3e808ddbc3c7d4369d","name":"Ioannis Lamprou","hidden":false},{"_id":"6a1d2d3e808ddbc3c7d4369e","user":{"_id":"6935eae2aec940506a205a90","avatarUrl":"/avatars/01c7e1706c5907ffaa9745386d8b05fc.svg","isPro":false,"fullname":"Zhen Zhang","user":"zzhang-fr","type":"user","name":"zzhang-fr"},"name":"Zhen Zhang","status":"claimed_verified","statusLastChangedAt":"2026-06-01T10:55:13.179Z","hidden":false},{"_id":"6a1d2d3e808ddbc3c7d4369f","name":"Yujie Yuan","hidden":false},{"_id":"6a1d2d3e808ddbc3c7d436a0","name":"Hongsheng Liu","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"LVSA: Training-Free Sparse Attention for Long Video Diffusion","submittedOnDailyBy":{"_id":"65f840681b8b18371208be4a","avatarUrl":"/avatars/b8df4f5be2519289558c3c8271d5104f.svg","isPro":false,"fullname":"Gaël Glorian","user":"gglorian","type":"user","name":"gglorian"},"summary":"Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, \"frozen\" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.","upvotes":11,"discussionId":"6a1d2d3f808ddbc3c7d436a1","githubRepo":"https://github.com/JiusiServe/LongVideoSparseAttention","githubRepoAddedBy":"user","ai_summary":"Long Video Sparse Attention (LVSA) addresses computational bottlenecks in video diffusion models by introducing a sparse attention mechanism that reduces compute costs while maintaining video quality beyond training horizons.","ai_keywords":["dense self-attention","video diffusion transformers","block-sparse attention","structured window pattern","rotating global anchors","FlashInfer kernel","RIFLEx","UltraViCo","VQeval","VBench-Long"],"githubStars":12},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65f840681b8b18371208be4a","avatarUrl":"/avatars/b8df4f5be2519289558c3c8271d5104f.svg","isPro":false,"fullname":"Gaël Glorian","user":"gglorian","type":"user"},{"_id":"66e9451795dd79e62a4727c4","avatarUrl":"/avatars/7e29f847ad8cc7fa124f7b1cea08b533.svg","isPro":false,"fullname":"Wijnand Suijlen","user":"wjsuijlenh","type":"user"},{"_id":"695655da2478e91bd99ceeb0","avatarUrl":"/avatars/8e03977d19f69447595e57c8d1dce86d.svg","isPro":false,"fullname":"Morocco","user":"chaimae759","type":"user"},{"_id":"6935eae2aec940506a205a90","avatarUrl":"/avatars/01c7e1706c5907ffaa9745386d8b05fc.svg","isPro":false,"fullname":"Zhen Zhang","user":"zzhang-fr","type":"user"},{"_id":"69ba705faccaff8b48fc255b","avatarUrl":"/avatars/6fe7e6cba84343d39eb0cad6c014c3e3.svg","isPro":false,"fullname":"Miguel Vieira Pereira","user":"Miguel0312","type":"user"},{"_id":"699d29b706b7f1593d1e0a3a","avatarUrl":"/avatars/6bf129cfee149be8377ececff2cfe61b.svg","isPro":false,"fullname":"Philippe GUEORGUIEVSKII","user":"PhilippeGSK","type":"user"},{"_id":"666aef2565dc971759ec7b12","avatarUrl":"/avatars/f0ac788461d2616246321dfe9effcc3b.svg","isPro":false,"fullname":"Pierre Leca","user":"peter20a","type":"user"},{"_id":"6a1d738012e3da18873c3942","avatarUrl":"/avatars/925aff98f9375a66decc93dc2c20d04d.svg","isPro":false,"fullname":"ttachon","user":"TThibaut","type":"user"},{"_id":"63039a1c0547362a22a384df","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63039a1c0547362a22a384df/FywK_6NACqVlf9CP7Ty-d.png","isPro":true,"fullname":"Dopamine","user":"CCP6","type":"user"},{"_id":"694abe5a1bfabb927ddd98cd","avatarUrl":"/avatars/bc798a55040e40865e5c29ce9002fb58.svg","isPro":false,"fullname":"Ioannis Lamprou","user":"ilamp","type":"user"},{"_id":"69bcfe25dfac620f17929c4a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/bU2YVqmu4XJCLjt-kBvtj.png","isPro":false,"fullname":"Scarlett Miller","user":"leod63","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31057.md"}">
Papers
arxiv:2605.31057

LVSA: Training-Free Sparse Attention for Long Video Diffusion

Published on May 29
· Submitted by
Gaël Glorian
on Jun 2
Authors:
,
,

Abstract

Long Video Sparse Attention (LVSA) addresses computational bottlenecks in video diffusion models by introducing a sparse attention mechanism that reduces compute costs while maintaining video quality beyond training horizons.

AI-generated summary

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.

Community

Paper author Paper submitter about 1 hour ago

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.31057
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.31057 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31057 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31057 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers