Hugging Face Daily Papers · · 4 min read

Kwai Keye-VL-2.0 Technical Report

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

A good paper!</p>\n","updatedAt":"2026-06-10T08:19:10.001Z","author":{"_id":"66c98c27fafc0fc87c280749","avatarUrl":"/avatars/c71db3bee0fcd9aabcc38fd871d1cb75.svg","fullname":"Tianming Liang","name":"liangtm","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6975038647651672},"editors":["liangtm"],"editorAvatarUrls":["/avatars/c71db3bee0fcd9aabcc38fd871d1cb75.svg"],"reactions":[{"reaction":"🔥","users":["Lorangan","wenbinKwai","lololololoki"],"count":3},{"reaction":"🚀","users":["Lorangan"],"count":1},{"reaction":"❤️","users":["Lorangan"],"count":1},{"reaction":"🤗","users":["Lorangan"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.10651","authors":[{"_id":"6a28cfb8e7d78ea7587e5406","name":"Kwai Keye Team","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5407","name":"Bin Wen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5408","name":"Changyi Liu","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5409","name":"Chengru Song","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540a","name":"Chongling Rao","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540b","name":"Guowang Zhang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540c","name":"Han Li","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540d","name":"Haonan Fan","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540e","name":"Hengrui Ju","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540f","name":"Jiankang Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5410","name":"Jiapeng Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5411","name":"Jiawei Yuan","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5412","name":"Kaixuan Yang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5413","name":"Kaiyu Jiang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5414","name":"Kun Gai","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5415","name":"Lingzhi Zhou","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5416","name":"Na Nie","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5417","name":"Sen Na","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5418","name":"Tianke Zhang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5419","name":"Tingting Gao","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541a","name":"Xuanyu Zheng","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541b","name":"Yulong Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541c","name":"Fan Yang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541d","name":"Haixuan Gao","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541e","name":"Lele Yang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541f","name":"Mingqiao Liu","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5420","name":"Muxi Diao","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5421","name":"Qi Zhang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5422","name":"Qile Su","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5423","name":"Wei Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5424","name":"Wentao Hong","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5425","name":"Xingyu Lu","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5426","name":"Yancheng Long","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5427","name":"Yankai Yang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5428","name":"Yingxin Li","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5429","name":"Yiyang Fan","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542a","name":"Yu Xia","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542b","name":"Yuzhe Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542c","name":"Ziliang Lai","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542d","name":"Chuan Yi","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542e","name":"Haonan Jia","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542f","name":"Tianming Liang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5430","name":"Weixin Xu","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5431","name":"Xiaoxiao Ma","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5432","name":"Yang Tian","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5433","name":"Yufei Han","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5434","name":"Feng Han","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5435","name":"Hang Li","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5436","name":"Jing Wang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5437","name":"Jinghui Jia","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5438","name":"Junmin Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5439","name":"Junyu Shi","hidden":false},{"_id":"6a28cfb8e7d78ea7587e543a","name":"Ruilin Zhang","hidden":false}],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"Kwai Keye-VL-2.0 Technical Report","submittedOnDailyBy":{"_id":"66c98c27fafc0fc87c280749","avatarUrl":"/avatars/c71db3bee0fcd9aabcc38fd871d1cb75.svg","isPro":false,"fullname":"Tianming Liang","user":"liangtm","type":"user","name":"liangtm"},"summary":"We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.","upvotes":166,"discussionId":"6a28cfb8e7d78ea7587e543b","projectPage":"https://kwai-keye.github.io/","githubRepo":"https://github.com/Kwai-Keye/Keye","githubRepoAddedBy":"user","ai_summary":"Kwai Keye-VL-2.0-30B-A3B is an open-source Mixture-of-Experts multimodal foundation model that enables long-video understanding and agentic intelligence through DeepSeek Sparse Attention and specialized training infrastructure.","ai_keywords":["Mixture-of-Experts","multimodal foundation model","DeepSeek Sparse Attention","GQA-based architectures","256K context processing","heterogeneous ViT-LM parallelism","custom DSA kernels","Cross-Modal Multi-Teacher On-Policy Distillation","Context-RL","Video-RL","dense token-level teacher feedback","on-policy rollouts","agent collaboration","Code","Tool","Search scenarios","multimodal self-correction"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":782,"organization":{"_id":"685d2e9b7aa0388f4fcaf7b2","name":"KwaiKeye","fullname":"Kwai Keye","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/685ba798484e3233f5ff6f11/gW0pROtvsWctypxYTBbWZ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64e2cc5a59aa536664154f66","avatarUrl":"/avatars/cf00531602bd6b7ff5a4221f7cac14a5.svg","isPro":false,"fullname":"chen","user":"henrywei1","type":"user"},{"_id":"653e58092dedebcbb795c679","avatarUrl":"/avatars/75db2e273395f43cbb254dba6e7b89bc.svg","isPro":false,"fullname":"chen","user":"weisuxi","type":"user"},{"_id":"689dc8836f1b1477ecb24eec","avatarUrl":"/avatars/bf756f090528bcb0ca47681700311710.svg","isPro":false,"fullname":"Wei Chen","user":"mufengcs","type":"user"},{"_id":"67f7cb492166f00da3b779a3","avatarUrl":"/avatars/a05faf5e2c89bc5737c928543a023657.svg","isPro":false,"fullname":"Yuzhe Chen","user":"Yuzhe0201","type":"user"},{"_id":"6596ca41b1a78672691f9560","avatarUrl":"/avatars/ee91aa28fbec0a3090472006ba1bc04f.svg","isPro":false,"fullname":"Liu Xikai","user":"KaneAllen","type":"user"},{"_id":"662e0b68a364f7df396afa4d","avatarUrl":"/avatars/584f8d1d589981c9d53480d6bdcea75a.svg","isPro":false,"fullname":"huyuhang","user":"Fleetinghyh","type":"user"},{"_id":"64c36c2ec3633e5b923e87d3","avatarUrl":"/avatars/94fccc1b9eff68939c7bbcd011d6417f.svg","isPro":false,"fullname":"蒋世鑫","user":"ThreeGold116","type":"user"},{"_id":"631aae3efac58c9c81663367","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662692909162-noauth.png","isPro":false,"fullname":"Adamx.chen","user":"AdamxChen","type":"user"},{"_id":"67442823aaf72013322092e6","avatarUrl":"/avatars/83148872d8c09ad6da5abf4ae83dcbde.svg","isPro":false,"fullname":"ling","user":"ling666666","type":"user"},{"_id":"650abe21948ce5dce8dd6d36","avatarUrl":"/avatars/a44672dc595c9ad7f61a7a8e8d1ed3ba.svg","isPro":false,"fullname":"xiangyu wu","user":"xybetter","type":"user"},{"_id":"6729c6aa9e4642e3f0256d14","avatarUrl":"/avatars/1d679f85fbceab6ace0526499e9990ce.svg","isPro":false,"fullname":"Tao Song","user":"Todd001","type":"user"},{"_id":"616538d1b5ec555e8e9c2035","avatarUrl":"/avatars/60e255bf9b29b50fe8dd1a23abaed6f6.svg","isPro":false,"fullname":"MelosY","user":"MelosY","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"685d2e9b7aa0388f4fcaf7b2","name":"KwaiKeye","fullname":"Kwai Keye","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/685ba798484e3233f5ff6f11/gW0pROtvsWctypxYTBbWZ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.10651.md"}">
Papers
arxiv:2606.10651

Kwai Keye-VL-2.0 Technical Report

Published on Jun 9
· Submitted by
Tianming Liang
on Jun 10
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Kwai Keye-VL-2.0-30B-A3B is an open-source Mixture-of-Experts multimodal foundation model that enables long-video understanding and agentic intelligence through DeepSeek Sparse Attention and specialized training infrastructure.

We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.

Community

Paper submitter about 9 hours ago

A good paper!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.10651
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.10651 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers