Hugging Face Daily Papers · · 4 min read

PEEK: Picking Essential frames via Efficient Knowledge distillation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

PEEK is a query-free frame selector for low-budget video captioning. It learns from a privileged caption-conditioned teacher, but at inference time it receives only video frames: no target caption, no prompt, and no text encoder. Given a budget of k frames, PEEK predicts per-frame relevance scores and returns the selected frames in temporal order, ready to be forwarded to a downstream Video-Language Model.</p>\n<p><a href=\"https://cdn-uploads.huggingface.co/production/uploads/6362afc7d3be91534c2ee7c1/Ha2SzpMzyKbDhi7XIki6A.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/6362afc7d3be91534c2ee7c1/Ha2SzpMzyKbDhi7XIki6A.png\" alt=\"peek_banner\"></a></p>\n<ul>\n<li><a href=\"https://arxiv.org/abs/2605.31029\" rel=\"nofollow\">Paper</a></li>\n<li><a href=\"https://huggingface.co/momentslab/peek\">Weights</a></li>\n<li><a href=\"https://huggingface.co/spaces/momentslab/peek\">Demo</a></li>\n</ul>\n","updatedAt":"2026-06-01T07:59:48.803Z","author":{"_id":"6362afc7d3be91534c2ee7c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667411893967-noauth.png","fullname":"Killian Steunou","name":"nelikCode","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.5173688530921936},"editors":["nelikCode"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1667411893967-noauth.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31029","authors":[{"_id":"6a1d3929808ddbc3c7d436e5","user":{"_id":"6362afc7d3be91534c2ee7c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667411893967-noauth.png","isPro":false,"fullname":"Killian Steunou","user":"nelikCode","type":"user","name":"nelikCode"},"name":"Killian Steunou","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:31:26.150Z","hidden":false},{"_id":"6a1d3929808ddbc3c7d436e6","user":{"_id":"657782d68628ec00e9cbdf80","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/hp7jzNYaw_inEzi5UDADj.jpeg","isPro":false,"fullname":"Filali Anas","user":"anasFilali","type":"user","name":"anasFilali"},"name":"Anas Filali Razzouki","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:31:42.759Z","hidden":false},{"_id":"6a1d3929808ddbc3c7d436e7","user":{"_id":"64f70494b23f1414f73b1293","avatarUrl":"/avatars/86c0bc7365c9cbcf09974f492ebc6fe6.svg","isPro":false,"fullname":"Khalil Guetari","user":"KhalilGuetari","type":"user","name":"KhalilGuetari"},"name":"Khalil Guetari","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:31:45.727Z","hidden":false},{"_id":"6a1d3929808ddbc3c7d436e8","name":"Mounîm A. El-Yacoubi","hidden":false},{"_id":"6a1d3929808ddbc3c7d436e9","user":{"_id":"62cd4fc3299c0c2e0e42f249","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd4fc3299c0c2e0e42f249/4_frS9S9RXVY4fidEoJSZ.jpeg","isPro":false,"fullname":"Yannis Tevissen","user":"YannisTevissen","type":"user","name":"YannisTevissen"},"name":"Yannis Tevissen","status":"claimed_verified","statusLastChangedAt":"2026-06-01T09:31:48.791Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/6362afc7d3be91534c2ee7c1/0rGZNN-666NSDPz2ZB8HW.mp4"],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"PEEK: Picking Essential frames via Efficient Knowledge distillation","submittedOnDailyBy":{"_id":"6362afc7d3be91534c2ee7c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667411893967-noauth.png","isPro":false,"fullname":"Killian Steunou","user":"nelikCode","type":"user","name":"nelikCode"},"summary":"Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only 5.2% to the captioning time, compared with 65.4% for CSTA and 211.9% for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.","upvotes":11,"discussionId":"6a1d3929808ddbc3c7d436ea","projectPage":"https://www.killian-steunou.com/peek","githubRepo":"https://github.com/momentslab/peek","githubRepoAddedBy":"user","ai_summary":"PEEK is an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a teacher model into a lightweight temporal model, outperforming state-of-the-art methods in video captioning while maintaining computational efficiency.","ai_keywords":["video-language models","frame selection","adaptive frame sampling","caption-conditioned frame relevance","teacher model","lightweight temporal model","vision language models","CIDEr","zero-shot evaluation","temporal coverage","visual diversity"],"githubStars":0,"organization":{"_id":"65f98c03281c4728d6963e04","name":"momentslab","fullname":"Moments Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd4fc3299c0c2e0e42f249/OETSSIzgwQTYYDBD5b1H4.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6362afc7d3be91534c2ee7c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1667411893967-noauth.png","isPro":false,"fullname":"Killian Steunou","user":"nelikCode","type":"user"},{"_id":"62cd4fc3299c0c2e0e42f249","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd4fc3299c0c2e0e42f249/4_frS9S9RXVY4fidEoJSZ.jpeg","isPro":false,"fullname":"Yannis Tevissen","user":"YannisTevissen","type":"user"},{"_id":"64105a83928400b41643ade0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64105a83928400b41643ade0/2qUJWjOA53bGUpjsxqWK4.png","isPro":false,"fullname":"Ege Eken","user":"EgeEken","type":"user"},{"_id":"6a1d3c27dc5908e8dd075890","avatarUrl":"/avatars/7a023809d281a71776416623e1ef6799.svg","isPro":false,"fullname":"Thibault Chassagnette","user":"tchassagnette","type":"user"},{"_id":"6300a2bc5ea4c617b216ef52","avatarUrl":"/avatars/ca26855881f8fba44fe9d8a2a353b8b0.svg","isPro":false,"fullname":"Luc Vedrenne","user":"almotasim","type":"user"},{"_id":"64f70494b23f1414f73b1293","avatarUrl":"/avatars/86c0bc7365c9cbcf09974f492ebc6fe6.svg","isPro":false,"fullname":"Khalil Guetari","user":"KhalilGuetari","type":"user"},{"_id":"6a1d42712c850ce78fca3ad7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/rp_gMmg8TAHR07B_iC9zd.jpeg","isPro":false,"fullname":"alexis otton","user":"turkelton","type":"user"},{"_id":"6a1d4ca4a72013ba98d334b5","avatarUrl":"/avatars/e579451b3f7e29109d216e5d5ac398b2.svg","isPro":false,"fullname":"momentslab","user":"jessicamm","type":"user"},{"_id":"663a42f8a4152b77b6c661a5","avatarUrl":"/avatars/b7e1cfb33bcdccb03cb9d32f68be1f65.svg","isPro":false,"fullname":"Gaël Ferrachat","user":"gaelft","type":"user"},{"_id":"6825ec235aa29e7dec262576","avatarUrl":"/avatars/8e2bd75c920ecc13da5f54243ec9afde.svg","isPro":false,"fullname":"filali","user":"filalianas-dev","type":"user"},{"_id":"6a1d5d5b7fe96d94aff903ac","avatarUrl":"/avatars/56a574c98f6666e41ece1b2ddcc45781.svg","isPro":false,"fullname":"Luce","user":"BenLuce","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"65f98c03281c4728d6963e04","name":"momentslab","fullname":"Moments Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cd4fc3299c0c2e0e42f249/OETSSIzgwQTYYDBD5b1H4.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31029.md"}">
Papers
arxiv:2605.31029

PEEK: Picking Essential frames via Efficient Knowledge distillation

Published on May 29
· Submitted by
Killian Steunou
on Jun 1

Abstract

PEEK is an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a teacher model into a lightweight temporal model, outperforming state-of-the-art methods in video captioning while maintaining computational efficiency.

AI-generated summary

Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only 5.2% to the captioning time, compared with 65.4% for CSTA and 211.9% for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.

Community

Paper author Paper submitter about 3 hours ago
edited about 3 hours ago

PEEK is a query-free frame selector for low-budget video captioning. It learns from a privileged caption-conditioned teacher, but at inference time it receives only video frames: no target caption, no prompt, and no text encoder. Given a budget of k frames, PEEK predicts per-frame relevance scores and returns the selected frames in temporal order, ready to be forwarded to a downstream Video-Language Model.

peek_banner

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.31029
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31029 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers