A good paper!</p>\n","updatedAt":"2026-06-10T08:19:10.001Z","author":{"_id":"66c98c27fafc0fc87c280749","avatarUrl":"/avatars/c71db3bee0fcd9aabcc38fd871d1cb75.svg","fullname":"Tianming Liang","name":"liangtm","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6975038647651672},"editors":["liangtm"],"editorAvatarUrls":["/avatars/c71db3bee0fcd9aabcc38fd871d1cb75.svg"],"reactions":[{"reaction":"🔥","users":["Lorangan","wenbinKwai","lololololoki"],"count":3},{"reaction":"🚀","users":["Lorangan"],"count":1},{"reaction":"❤️","users":["Lorangan"],"count":1},{"reaction":"🤗","users":["Lorangan"],"count":1}],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.10651","authors":[{"_id":"6a28cfb8e7d78ea7587e5406","name":"Kwai Keye Team","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5407","name":"Bin Wen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5408","name":"Changyi Liu","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5409","name":"Chengru Song","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540a","name":"Chongling Rao","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540b","name":"Guowang Zhang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540c","name":"Han Li","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540d","name":"Haonan Fan","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540e","name":"Hengrui Ju","hidden":false},{"_id":"6a28cfb8e7d78ea7587e540f","name":"Jiankang Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5410","name":"Jiapeng Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5411","name":"Jiawei Yuan","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5412","name":"Kaixuan Yang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5413","name":"Kaiyu Jiang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5414","name":"Kun Gai","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5415","name":"Lingzhi Zhou","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5416","name":"Na Nie","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5417","name":"Sen Na","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5418","name":"Tianke Zhang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5419","name":"Tingting Gao","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541a","name":"Xuanyu Zheng","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541b","name":"Yulong Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541c","name":"Fan Yang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541d","name":"Haixuan Gao","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541e","name":"Lele Yang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e541f","name":"Mingqiao Liu","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5420","name":"Muxi Diao","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5421","name":"Qi Zhang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5422","name":"Qile Su","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5423","name":"Wei Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5424","name":"Wentao Hong","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5425","name":"Xingyu Lu","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5426","name":"Yancheng Long","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5427","name":"Yankai Yang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5428","name":"Yingxin Li","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5429","name":"Yiyang Fan","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542a","name":"Yu Xia","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542b","name":"Yuzhe Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542c","name":"Ziliang Lai","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542d","name":"Chuan Yi","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542e","name":"Haonan Jia","hidden":false},{"_id":"6a28cfb8e7d78ea7587e542f","name":"Tianming Liang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5430","name":"Weixin Xu","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5431","name":"Xiaoxiao Ma","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5432","name":"Yang Tian","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5433","name":"Yufei Han","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5434","name":"Feng Han","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5435","name":"Hang Li","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5436","name":"Jing Wang","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5437","name":"Jinghui Jia","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5438","name":"Junmin Chen","hidden":false},{"_id":"6a28cfb8e7d78ea7587e5439","name":"Junyu Shi","hidden":false},{"_id":"6a28cfb8e7d78ea7587e543a","name":"Ruilin Zhang","hidden":false}],"publishedAt":"2026-06-09T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"Kwai Keye-VL-2.0 Technical Report","submittedOnDailyBy":{"_id":"66c98c27fafc0fc87c280749","avatarUrl":"/avatars/c71db3bee0fcd9aabcc38fd871d1cb75.svg","isPro":false,"fullname":"Tianming Liang","user":"liangtm","type":"user","name":"liangtm"},"summary":"We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.","upvotes":166,"discussionId":"6a28cfb8e7d78ea7587e543b","projectPage":"https://kwai-keye.github.io/","githubRepo":"https://github.com/Kwai-Keye/Keye","githubRepoAddedBy":"user","ai_summary":"Kwai Keye-VL-2.0-30B-A3B is an open-source Mixture-of-Experts multimodal foundation model that enables long-video understanding and agentic intelligence through DeepSeek Sparse Attention and specialized training infrastructure.","ai_keywords":["Mixture-of-Experts","multimodal foundation model","DeepSeek Sparse Attention","GQA-based architectures","256K context processing","heterogeneous ViT-LM parallelism","custom DSA kernels","Cross-Modal Multi-Teacher On-Policy Distillation","Context-RL","Video-RL","dense token-level teacher feedback","on-policy rollouts","agent collaboration","Code","Tool","Search scenarios","multimodal self-correction"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":782,"organization":{"_id":"685d2e9b7aa0388f4fcaf7b2","name":"KwaiKeye","fullname":"Kwai Keye","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/685ba798484e3233f5ff6f11/gW0pROtvsWctypxYTBbWZ.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"64e2cc5a59aa536664154f66","avatarUrl":"/avatars/cf00531602bd6b7ff5a4221f7cac14a5.svg","isPro":false,"fullname":"chen","user":"henrywei1","type":"user"},{"_id":"653e58092dedebcbb795c679","avatarUrl":"/avatars/75db2e273395f43cbb254dba6e7b89bc.svg","isPro":false,"fullname":"chen","user":"weisuxi","type":"user"},{"_id":"689dc8836f1b1477ecb24eec","avatarUrl":"/avatars/bf756f090528bcb0ca47681700311710.svg","isPro":false,"fullname":"Wei Chen","user":"mufengcs","type":"user"},{"_id":"67f7cb492166f00da3b779a3","avatarUrl":"/avatars/a05faf5e2c89bc5737c928543a023657.svg","isPro":false,"fullname":"Yuzhe Chen","user":"Yuzhe0201","type":"user"},{"_id":"6596ca41b1a78672691f9560","avatarUrl":"/avatars/ee91aa28fbec0a3090472006ba1bc04f.svg","isPro":false,"fullname":"Liu Xikai","user":"KaneAllen","type":"user"},{"_id":"662e0b68a364f7df396afa4d","avatarUrl":"/avatars/584f8d1d589981c9d53480d6bdcea75a.svg","isPro":false,"fullname":"huyuhang","user":"Fleetinghyh","type":"user"},{"_id":"64c36c2ec3633e5b923e87d3","avatarUrl":"/avatars/94fccc1b9eff68939c7bbcd011d6417f.svg","isPro":false,"fullname":"蒋世鑫","user":"ThreeGold116","type":"user"},{"_id":"631aae3efac58c9c81663367","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1662692909162-noauth.png","isPro":false,"fullname":"Adamx.chen","user":"AdamxChen","type":"user"},{"_id":"67442823aaf72013322092e6","avatarUrl":"/avatars/83148872d8c09ad6da5abf4ae83dcbde.svg","isPro":false,"fullname":"ling","user":"ling666666","type":"user"},{"_id":"650abe21948ce5dce8dd6d36","avatarUrl":"/avatars/a44672dc595c9ad7f61a7a8e8d1ed3ba.svg","isPro":false,"fullname":"xiangyu wu","user":"xybetter","type":"user"},{"_id":"6729c6aa9e4642e3f0256d14","avatarUrl":"/avatars/1d679f85fbceab6ace0526499e9990ce.svg","isPro":false,"fullname":"Tao Song","user":"Todd001","type":"user"},{"_id":"616538d1b5ec555e8e9c2035","avatarUrl":"/avatars/60e255bf9b29b50fe8dd1a23abaed6f6.svg","isPro":false,"fullname":"MelosY","user":"MelosY","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":2,"organization":{"_id":"685d2e9b7aa0388f4fcaf7b2","name":"KwaiKeye","fullname":"Kwai Keye","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/685ba798484e3233f5ff6f11/gW0pROtvsWctypxYTBbWZ.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.10651.md"}">
Kwai Keye-VL-2.0 Technical Report
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Kwai Keye-VL-2.0-30B-A3B is an open-source Mixture-of-Experts multimodal foundation model that enables long-video understanding and agentic intelligence through DeepSeek Sparse Attention and specialized training infrastructure.
We introduce Kwai Keye-VL-2.0-30B-A3B, an open-source Mixture-of-Experts (MoE) multimodal foundation model designed to advance long-video understanding and agentic intelligence. To address the challenges of ultra-long contexts, information redundancy, and prohibitive computational costs inherent in hour-level videos, Keye-VL-2.0 is the first to adapt DeepSeek Sparse Attention (DSA) to GQA-based multimodal architectures, enabling lossless 256K context processing while capturing critical frames and long-range temporal dependencies. This architecture is underpinned by a highly optimized training and inference infrastructure, including scalable video I/O, heterogeneous ViT-LM parallelism, and custom DSA kernels that significantly maximize throughput and minimize computational overhead. Furthermore, to overcome the algorithmic dilemma of catastrophic forgetting during multi-task alignment, we introduce Cross-Modal Multi-Teacher On-Policy Distillation (MOPD) paired with Context-RL and Video-RL. By distilling dense token-level teacher feedback from on-policy rollouts back into the MoE backbone, which activates only 3B parameters, Keye-VL-2.0 natively empowers advanced agent collaboration across Code, Tool, and Search scenarios with multimodal self-correction. Extensive evaluations across video understanding, temporal grounding, reasoning, STEM, and agent benchmarks demonstrate that Keye-VL-2.0-30B-A3B achieves state-of-the-art performance among models of similar scale, particularly excelling in fine-grained temporal localization on TimeLens and long-video comprehension on Video-MME-v2 and LongVideoBench. We release our model checkpoints to accelerate community progress toward scalable and robust multimodal agentic applications.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.10651 in a dataset README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.