Hugging Face Daily Papers · June 16, 2026 · 3 min read

MVEB: Massive Video Embedding Benchmark

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

<a href=\"https://cdn-uploads.huggingface.co/production/uploads/5ff5943752c26e9bc240bada/j3XVmEgqZIQ-5neHy7Gav.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/5ff5943752c26e9bc240bada/j3XVmEgqZIQ-5neHy7Gav.png\" alt=\"Screenshot 2026-06-16 at 15.15.18\"></a>\n<a href=\"https://cdn-uploads.huggingface.co/production/uploads/5ff5943752c26e9bc240bada/mQ68f1HJnAs7eB-FqQajl.png\" rel=\"nofollow\"><img src=\"https://cdn-uploads.huggingface.co/production/uploads/5ff5943752c26e9bc240bada/mQ68f1HJnAs7eB-FqQajl.png\" alt=\"Screenshot 2026-06-16 at 15.15.39\"></a>\n<hr>\nCode for running the benchmark can be found in <a href=\"https://github.com/embeddings-benchmark/mteb\" rel=\"nofollow\">mteb</a>, while scripts for reproducing paper artifacts will be made available at <a href=\"https://github.com/embeddings-benchmark/mveb-paper\" rel=\"nofollow\">mveb-paper</a> once the paper has been reviewed and finalized.\n","updatedAt":"2026-06-16T13:17:44.974Z","author":{"_id":"5ff5943752c26e9bc240bada","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5ff5943752c26e9bc240bada/Exyzf3C_gJ2KdsL4K5_cq.png","fullname":"Kenneth C. Enevoldsen","name":"KennethEnevoldsen","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":69,"isUserFollowing":false}},"numEdits":3,"identifiedLanguage":{"language":"hu","probability":0.3584110140800476},"editors":["KennethEnevoldsen"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/5ff5943752c26e9bc240bada/Exyzf3C_gJ2KdsL4K5_cq.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.14958","authors":[{"_id":"6a30f077a0d4daae428602cc","user":{"_id":"6671be9ff022d14aa10df864","avatarUrl":"/avatars/dd085abefa38c1604dc2ceabf472816d.svg","isPro":false,"fullname":"Adnan El Assadi","user":"AdnanElAssadi","type":"user","name":"AdnanElAssadi"},"name":"Adnan El Assadi","status":"claimed_verified","statusLastChangedAt":"2026-06-16T16:14:46.911Z","hidden":false},{"_id":"6a30f077a0d4daae428602cd","user":{"_id":"61af4544d691b3aadd1f62b6","avatarUrl":"/avatars/7a4067accdd1005f78c3c4adad3ee0a5.svg","isPro":false,"fullname":"Solomatin Roman","user":"Samoed","type":"user","name":"Samoed"},"name":"Roman Solomatin","status":"claimed_verified","statusLastChangedAt":"2026-06-16T09:47:29.765Z","hidden":false},{"_id":"6a30f077a0d4daae428602ce","name":"Isaac Chung","hidden":false},{"_id":"6a30f077a0d4daae428602cf","name":"Chenghao Xiao","hidden":false},{"_id":"6a30f077a0d4daae428602d0","name":"Deep Shah","hidden":false},{"_id":"6a30f077a0d4daae428602d1","name":"Manan Dey","hidden":false},{"_id":"6a30f077a0d4daae428602d2","name":"Shriya Sudhakar","hidden":false},{"_id":"6a30f077a0d4daae428602d3","name":"Zacharie Bugaud","hidden":false},{"_id":"6a30f077a0d4daae428602d4","name":"Wissam Siblini","hidden":false},{"_id":"6a30f077a0d4daae428602d5","name":"Ayush Sunil Munot","hidden":false},{"_id":"6a30f077a0d4daae428602d6","name":"Yashwanth Devavarapu","hidden":false},{"_id":"6a30f077a0d4daae428602d7","name":"Rakshitha Ireddi","hidden":false},{"_id":"6a30f077a0d4daae428602d8","name":"Michelle Yang","hidden":false},{"_id":"6a30f077a0d4daae428602d9","name":"Márton Kardos","hidden":false},{"_id":"6a30f077a0d4daae428602da","name":"Niklas Muennighoff","hidden":false},{"_id":"6a30f077a0d4daae428602db","user":{"_id":"5ff5943752c26e9bc240bada","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5ff5943752c26e9bc240bada/Exyzf3C_gJ2KdsL4K5_cq.png","isPro":false,"fullname":"Kenneth C. Enevoldsen","user":"KennethEnevoldsen","type":"user","name":"KennethEnevoldsen"},"name":"Kenneth Enevoldsen","status":"claimed_verified","statusLastChangedAt":"2026-06-16T16:14:48.597Z","hidden":false}],"publishedAt":"2026-06-12T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"MVEB: Massive Video Embedding Benchmark","submittedOnDailyBy":{"_id":"5ff5943752c26e9bc240bada","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5ff5943752c26e9bc240bada/Exyzf3C_gJ2KdsL4K5_cq.png","isPro":false,"fullname":"Kenneth C. Enevoldsen","user":"KennethEnevoldsen","type":"user","name":"KennethEnevoldsen"},"summary":"We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.","upvotes":6,"discussionId":"6a30f077a0d4daae428602dc","projectPage":"https://embeddings-benchmark.github.io/leaderboard-frontend/benchmark/MVEB(beta)","ai_summary":"A large-scale video embedding benchmark evaluates diverse models across multiple video understanding tasks, revealing that different model architectures excel in specific domains and demonstrating the nuanced impact of audio on performance based on dataset characteristics.","ai_keywords":["video embeddings","multimodal binding","generative MLLMs","cross-modal tasks","zero-shot classification","video-centric question answering","MTEB ecosystem"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"624bfda5459c48438cc39f80","name":"mteb","fullname":"Massive Text Embedding Benchmark","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5ff5943752c26e9bc240bada/OrZxdlg8doDNO2TZ6Q58G.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"5ff5943752c26e9bc240bada","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5ff5943752c26e9bc240bada/Exyzf3C_gJ2KdsL4K5_cq.png","isPro":false,"fullname":"Kenneth C. Enevoldsen","user":"KennethEnevoldsen","type":"user"},{"_id":"6671be9ff022d14aa10df864","avatarUrl":"/avatars/dd085abefa38c1604dc2ceabf472816d.svg","isPro":false,"fullname":"Adnan El Assadi","user":"AdnanElAssadi","type":"user"},{"_id":"63108cc834c7d77420b0fd68","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63108cc834c7d77420b0fd68/taDnqEmcI9Rhe3uzcPEE3.jpeg","isPro":false,"fullname":"Chenghao Xiao","user":"gowitheflow","type":"user"},{"_id":"5f1eb362eec0ad2a071ad6e2","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/5f1eb362eec0ad2a071ad6e2/nDiBXdLrOTw67lJp_y_WA.jpeg","isPro":false,"fullname":"Niklas Muennighoff","user":"Muennighoff","type":"user"},{"_id":"6754994f0a4a1144aec6ef57","avatarUrl":"/avatars/9dc00280582bcb0ace57cb34d25e91a0.svg","isPro":false,"fullname":"Ayush Sunil Munot","user":"AyushM6","type":"user"},{"_id":"62d806da720a579b3bd8bb5c","avatarUrl":"/avatars/c228e5fc8deafbda8b19fd80ce8c146e.svg","isPro":false,"fullname":"Zach","user":"zachz","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"624bfda5459c48438cc39f80","name":"mteb","fullname":"Massive Text Embedding Benchmark","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5ff5943752c26e9bc240bada/OrZxdlg8doDNO2TZ6Q58G.png"},"query":{}}">

Papers

arxiv:2606.14958

MVEB: Massive Video Embedding Benchmark

Published on Jun 12

· Submitted by

Kenneth C. Enevoldsen on Jun 16

Massive Text Embedding Benchmark

Upvote

Authors:

Adnan El Assadi ,

Roman Solomatin ,

Kenneth Enevoldsen

Abstract

A large-scale video embedding benchmark evaluates diverse models across multiple video understanding tasks, revealing that different model architectures excel in specific domains and demonstrating the nuanced impact of audio on performance based on dataset characteristics.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

View arXiv page View PDF Project page Add to collection

Community

KennethEnevoldsen

Paper author Paper submitter about 7 hours ago

•

edited about 7 hours ago

Code for running the benchmark can be found in mteb, while scripts for reproducing paper artifacts will be made available at mveb-paper once the paper has been reviewed and finalized.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.14958 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.14958 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.14958 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

MVEB: Massive Video Embedding Benchmark

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers