Hugging Face Daily Papers · · 3 min read

Linear Scaling Video VLMs for Long Video Understanding

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

StateKV is an inference-time method that enables linear-time video prefill for video vision-language models by using a fixed-capacity recurrent state, improving efficiency for long video understanding.</p>\n","updatedAt":"2026-06-01T02:28:42.042Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":309,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8585668802261353},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31598","authors":[{"_id":"6a1cee39808ddbc3c7d43463","name":"Cristobal Eyzaguirre","hidden":false},{"_id":"6a1cee39808ddbc3c7d43464","name":"Jiajun Wu","hidden":false},{"_id":"6a1cee39808ddbc3c7d43465","name":"Juan Carlos Niebles","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6039478ab3ecf716b1a5fd4d/oLCOXemdb3j9RWu50fzQI.mp4"],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Linear Scaling Video VLMs for Long Video Understanding","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.","upvotes":3,"discussionId":"6a1cee39808ddbc3c7d43466","ai_summary":"StateKV enables efficient long-video vision-language model inference by maintaining cross-frame context in a fixed-capacity recurrent state while using a full per-frame cache for decoding, achieving linear-time prefill with minimal accuracy loss compared to full self-attention.","ai_keywords":["video vision-language models","spatiotemporal self-attention","video encoders","cross-frame context","recurrent state","full per-frame cache","long-video benchmarks","sliding-window","recency-based streaming approximations","FLOPs"],"organization":{"_id":"672c672dcf09d152f4da04c4","name":"StanfordUniversity","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/vJI0POlzGMXL2878t1vz2.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6628efe14e1fa854f48d3a28","avatarUrl":"/avatars/aa5421149a07a82b5c2a25978f9b6926.svg","isPro":false,"fullname":"Bryan Sangwoo Kim","user":"bryanswkim","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"672c672dcf09d152f4da04c4","name":"StanfordUniversity","fullname":"Stanford University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/68e396f2b5bb631e9b2fac9a/vJI0POlzGMXL2878t1vz2.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31598.md"}">
Papers
arxiv:2605.31598

Linear Scaling Video VLMs for Long Video Understanding

Published on May 29
· Submitted by
taesiri
on Jun 1
Authors:
,
,

Abstract

StateKV enables efficient long-video vision-language model inference by maintaining cross-frame context in a fixed-capacity recurrent state while using a full per-frame cache for decoding, achieving linear-time prefill with minimal accuracy loss compared to full self-attention.

AI-generated summary

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.

Community

Paper submitter about 9 hours ago

StateKV is an inference-time method that enables linear-time video prefill for video vision-language models by using a fixed-capacity recurrent state, improving efficiency for long video understanding.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.31598
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.31598 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31598 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31598 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers