Hugging Face Daily Papers · June 2, 2026 · 4 min read

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion</p>\n","updatedAt":"2026-06-02T02:27:13.588Z","author":{"_id":"63597e84d72fc0539e72b507","avatarUrl":"/avatars/f568302b70064220e3c824577e5bece4.svg","fullname":"Hidir Yesiltepe","name":"Hidir","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"de","probability":0.26643380522727966},"editors":["Hidir"],"editorAvatarUrls":["/avatars/f568302b70064220e3c824577e5bece4.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.30351","authors":[{"_id":"6a1e3eab808ddbc3c7d43be1","name":"Hidir Yesiltepe","hidden":false},{"_id":"6a1e3eab808ddbc3c7d43be2","name":"Jiazhen Hu","hidden":false},{"_id":"6a1e3eab808ddbc3c7d43be3","name":"Tuna Han Salih Meral","hidden":false},{"_id":"6a1e3eab808ddbc3c7d43be4","name":"Adil Kaan Akan","hidden":false},{"_id":"6a1e3eab808ddbc3c7d43be5","name":"Kaan Oktay","hidden":false},{"_id":"6a1e3eab808ddbc3c7d43be6","name":"Hoda Eldardiry","hidden":false},{"_id":"6a1e3eab808ddbc3c7d43be7","name":"Pinar Yanardag","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/63597e84d72fc0539e72b507/0-MAjjp3xIwi_-B5O5ARx.mp4"],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion","submittedOnDailyBy":{"_id":"63597e84d72fc0539e72b507","avatarUrl":"/avatars/f568302b70064220e3c824577e5bece4.svg","isPro":true,"fullname":"Hidir Yesiltepe","user":"Hidir","type":"user","name":"Hidir"},"summary":"Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.","upvotes":20,"discussionId":"6a1e3eab808ddbc3c7d43be8","projectPage":"https://videomla.github.io/","githubRepo":"https://github.com/yesiltepe-hidir/VideoMLA","githubRepoAddedBy":"user","ai_summary":"VideoMLA reduces memory usage in video diffusion models by replacing per-head keys and values with shared low-rank content and decoupled 3D-RoPE positional keys, maintaining quality while achieving significant compression and improved throughput.","ai_keywords":["causal video diffusion","KV cache","Multi-Head Latent Attention","video diffusion","3D-RoPE","low-rank content","spectral assumption","effective rank","video benchmarking","throughput"],"githubStars":1,"organization":{"_id":"6877c8adc38b08df75abb42c","name":"mayzovt","fullname":"Virginia Tech","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/683f717733a4cbbecbdd6cfa/HFBWbwPhKhAhTTE4F6hEA.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68cc1193712ea56cd647667a","avatarUrl":"/avatars/30939f5bb6d7a954332801c619eddce5.svg","isPro":false,"fullname":"Kaan Oktay","user":"kaan-oktay","type":"user"},{"_id":"64f8b03f83807928d25e766f","avatarUrl":"/avatars/68fd4ee967a1673a1d78a7581be8b3da.svg","isPro":false,"fullname":"Tuna Han Salih Meral","user":"tmeral","type":"user"},{"_id":"692e58fe816a236de01c96b1","avatarUrl":"/avatars/a18e34be712e97bdfcbad00a5fa83ff9.svg","isPro":false,"fullname":"Jen H","user":"Jen89911998","type":"user"},{"_id":"66b0628cd59c09785e443e08","avatarUrl":"/avatars/7e58620d7bec4b9ba1881bfcfddf4e23.svg","isPro":false,"fullname":"Hidir Yesiltepe","user":"hidir-yesiltepe","type":"user"},{"_id":"692e752c4f38e991f10b1de4","avatarUrl":"/avatars/54096322f0f3eb63a62640e609de1fa1.svg","isPro":false,"fullname":"Jeen H","user":"Jeeennn","type":"user"},{"_id":"692e75df4f38e991f10b33e5","avatarUrl":"/avatars/9822a35e165b3d5f65eeea1340122302.svg","isPro":false,"fullname":"Olly Stevens","user":"ollysteven","type":"user"},{"_id":"692e7510a874497162cbaab0","avatarUrl":"/avatars/4aee41c6e4b947c4987629bb1e0f9586.svg","isPro":false,"fullname":"miller","user":"jackm21","type":"user"},{"_id":"66686e778662162456f96274","avatarUrl":"/avatars/a4fafe865bc0b2ccb8d144514eb3b462.svg","isPro":false,"fullname":"Stylebreeder Dataset","user":"stylebreeder","type":"user"},{"_id":"692e7678cb750185ece09825","avatarUrl":"/avatars/d7191067efebf84022624522b4ad7c04.svg","isPro":false,"fullname":"Mark Latimer","user":"mark-latim","type":"user"},{"_id":"65302b5769e5dbd6a2d4b4ee","avatarUrl":"/avatars/cac79a5fb051a539ffaa15596974673b.svg","isPro":false,"fullname":"Connor Dunlop","user":"cdunlop","type":"user"},{"_id":"69bcc98a4df1e2c004ba85db","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/W-jLaA7jXCPw-pClqFgxH.png","isPro":false,"fullname":"杨瑞林","user":"wushiyu4","type":"user"},{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6877c8adc38b08df75abb42c","name":"mayzovt","fullname":"Virginia Tech","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/683f717733a4cbbecbdd6cfa/HFBWbwPhKhAhTTE4F6hEA.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.30351.md"}">

Papers

arxiv:2605.30351

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Published on May 28

· Submitted by

Hidir Yesiltepe on Jun 2

Virginia Tech

Upvote

Authors:

Abstract

VideoMLA reduces memory usage in video diffusion models by replacing per-head keys and values with shared low-rank content and decoupled 3D-RoPE positional keys, maintaining quality while achieving significant compression and improved throughput.

AI-generated summary

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

Hidir

Paper submitter about 8 hours ago

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.30351

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30351 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30351 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30351 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers