Hugging Face Daily Papers · · 5 min read

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain’s predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02× additional computational cost over baseline making it 30×cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.</p>\n","updatedAt":"2026-05-22T01:58:15.570Z","author":{"_id":"64bcc06fb567ae97c3272d3d","avatarUrl":"/avatars/bcb61fe9e575154d84913a1501971f1a.svg","fullname":"kim","name":"dahyekim","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8627158999443054},"editors":["dahyekim"],"editorAvatarUrls":["/avatars/bcb61fe9e575154d84913a1501971f1a.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.22678","authors":[{"_id":"6a0fb46ca53a61ce2e422bed","name":"Dahye Kim","hidden":false},{"_id":"6a0fb46ca53a61ce2e422bee","name":"Bhuvan Sachdeva","hidden":false},{"_id":"6a0fb46ca53a61ce2e422bef","name":"Karan Uppal","hidden":false},{"_id":"6a0fb46ca53a61ce2e422bf0","name":"Naman Gupta","hidden":false},{"_id":"6a0fb46ca53a61ce2e422bf1","name":"Vineeth N. Balasubramanian","hidden":false},{"_id":"6a0fb46ca53a61ce2e422bf2","name":"Deepti Ghadiyaram","hidden":false}],"publishedAt":"2026-05-21T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"Swift Sampling: Selecting Temporal Surprises via Taylor Series","submittedOnDailyBy":{"_id":"64bcc06fb567ae97c3272d3d","avatarUrl":"/avatars/bcb61fe9e575154d84913a1501971f1a.svg","isPro":false,"fullname":"kim","user":"dahyekim","type":"user","name":"dahyekim"},"summary":"While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.","upvotes":5,"discussionId":"6a0fb46da53a61ce2e422bf3","projectPage":"https://kim-dahye.github.io/swift-sampling/","ai_summary":"Swift Sampling is a training-free frame selection algorithm that identifies high-information video moments by analyzing deviations from predicted visual feature trajectories in latent space.","ai_keywords":["temporal surprises","predictive coding","visual latent space","Taylor expansion","frame selection","video trajectory","velocity","acceleration","visual features","latent space"],"organization":{"_id":"63864e015ea0bcbedb132e5a","name":"BostonU","fullname":"Boston University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669746226855-63864a6a06858a85f592d5ae.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64bcc06fb567ae97c3272d3d","avatarUrl":"/avatars/bcb61fe9e575154d84913a1501971f1a.svg","isPro":false,"fullname":"kim","user":"dahyekim","type":"user"},{"_id":"6696b81167c22a79a15ebaef","avatarUrl":"/avatars/57ce0329c4a2c46481818bc99c1d7f17.svg","isPro":false,"fullname":"Seongheon Park","user":"sam121796","type":"user"},{"_id":"6351e5bb3734c6e8a5c1bec1","avatarUrl":"/avatars/a784a51b369b197398575c3afbd5ceab.svg","isPro":false,"fullname":"Han-Bit Kang","user":"hbkang","type":"user"},{"_id":"65bb837dbfb878f46c77de4c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bb837dbfb878f46c77de4c/23gZ_lBEwyoqjexFy9QLD.jpeg","isPro":true,"fullname":"Prithiv Sakthi","user":"prithivMLmods","type":"user"},{"_id":"69a409f70f729f76df67b69e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/VA-yTEGXeX-0zL_qqGPBS.jpeg","isPro":false,"fullname":"Luke Robinson","user":"loganm92","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"63864e015ea0bcbedb132e5a","name":"BostonU","fullname":"Boston University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1669746226855-63864a6a06858a85f592d5ae.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.22678.md"}">
Papers
arxiv:2605.22678

Swift Sampling: Selecting Temporal Surprises via Taylor Series

Published on May 21
· Submitted by
kim
on May 22
Authors:
,
,
,
,
,

Abstract

Swift Sampling is a training-free frame selection algorithm that identifies high-information video moments by analyzing deviations from predicted visual feature trajectories in latent space.

AI-generated summary

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02x additional computational cost over baseline making it 30x cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

Community

Paper submitter about 10 hours ago

While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain’s predictive coding, we introduce Swift Sampling, an elegant, training-free frame selection algorithm that automatically identifies high-information moments in a video. Specifically, we model a video as a differentiable trajectory in the visual latent space and compute the velocity and acceleration of its features. Then, we apply Taylor expansion to project the expected path of subsequent frames. Frames that diverge sharply from this predicted manifold are identified as temporally surprising frames and selected for sampling. Unlike prior training-free methods that rely on auxiliary networks or video-specific hyperparameter tuning, Swift Sampling is incredibly lightweight, adding only 0.02× additional computational cost over baseline making it 30×cheaper overhead than leading baselines. Across three long-video question answering benchmarks and 10 different downstream tasks, Swift Sampling outperforms uniform sampling and prior query-agnostic baselines. It is especially powerful for long videos with limited frame budgets improving accuracy by up to +12.5 points.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.22678
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.22678 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.22678 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.22678 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers