Hugging Face Daily Papers · June 11, 2026 · 3 min read

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

#model-release #multimodal #agents #reasoning #benchmark

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

A framework for long-horizon video understanding via closed-loop contextual reasoning and efficient latent attention.</p>\n<p>Github: <a href=\"https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo3\" rel=\"nofollow\">https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo3</a></p>\n","updatedAt":"2026-06-11T02:37:57.174Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":314,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6941334009170532},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12195","authors":[{"_id":"6a2a1f6480a9c7c6830c0ee3","user":{"_id":"68d3a8cc20a82c2ec48f2044","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xXFan0XPalNWLdYeSl89r.png","isPro":false,"fullname":"Ziang Yan","user":"yanziang","type":"user","name":"yanziang"},"name":"Ziang Yan","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:32.374Z","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee4","name":"Sheng Xia","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee5","name":"Jiashuo Yu","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee6","name":"Yue Wu","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee7","user":{"_id":"6744754ff9940208b97a6a9a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6744754ff9940208b97a6a9a/PRG6_0jAfsj0uoUJvKyWf.png","isPro":false,"fullname":"Tianxiang Jiang","user":"Eurayka","type":"user","name":"Eurayka"},"name":"Tianxiang Jiang","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:28.638Z","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee8","name":"Songze Li","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee9","name":"Kanghui Tian","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eea","user":{"_id":"682c163fa17480053339f270","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UQZW2bM9UEtlVHDUq0yel.png","isPro":false,"fullname":"Yicheng Xu","user":"linghan199","type":"user","name":"linghan199"},"name":"Yicheng Xu","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:26.059Z","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eeb","name":"Yinan He","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eec","name":"Kai Chen","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eed","name":"Limin Wang","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eee","name":"Yu Qiao","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eef","name":"Yi Wang","hidden":false}],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.","upvotes":17,"discussionId":"6a2a1f6480a9c7c6830c0ef0","ai_summary":"InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities.","ai_keywords":["Multimodal Contextual Reasoning","M^2LA","KV-cache states","token-preserving reparameterization","staged training","continued pretraining","supervised fine-tuning","reinforcement learning","on-policy distillation","video agent","evidence accumulation","closed-loop reasoning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68d3a8cc20a82c2ec48f2044","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xXFan0XPalNWLdYeSl89r.png","isPro":false,"fullname":"Ziang Yan","user":"yanziang","type":"user"},{"_id":"62aafa49f29ff279b51f0182","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62aafa49f29ff279b51f0182/rQx8QFQGOY2qIhqJ8zSRj.jpeg","isPro":false,"fullname":"yinanhe","user":"ynhe","type":"user"},{"_id":"66ab3eaf1f83b210aeb4facf","avatarUrl":"/avatars/eaa5cc53acd8e39812d6b4758209ce23.svg","isPro":false,"fullname":"changsong","user":"downdric","type":"user"},{"_id":"6744754ff9940208b97a6a9a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6744754ff9940208b97a6a9a/PRG6_0jAfsj0uoUJvKyWf.png","isPro":false,"fullname":"Tianxiang Jiang","user":"Eurayka","type":"user"},{"_id":"654b380bdecdf18913db982d","avatarUrl":"/avatars/14698b6a532828a615e3bc41d62518f7.svg","isPro":false,"fullname":"Wu Yue","user":"May010129","type":"user"},{"_id":"6708ca2bfd7dc0bbf9e7c156","avatarUrl":"/avatars/d971968882199e003434b0f4a0a4b63a.svg","isPro":false,"fullname":"Travis Xia(sii)","user":"travis-xia","type":"user"},{"_id":"634263017a0225764f4801e4","avatarUrl":"/avatars/0740202cf0d047f7d3c3fbb35893f76e.svg","isPro":false,"fullname":"zqlai","user":"Laizhengqin","type":"user"},{"_id":"63119a987680dc699b2031df","avatarUrl":"/avatars/db63ab73806ee9cb8aa01be82d9effdd.svg","isPro":false,"fullname":"Yi Wang","user":"shepnerd","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"65c4eb7cd1dcbd30d86febec","avatarUrl":"/avatars/001c8f02e8ce794b2c21883628b2da72.svg","isPro":false,"fullname":"free-bit","user":"free-bit","type":"user"},{"_id":"6953897fa6ebf89c814f4cc5","avatarUrl":"/avatars/5f287f9e303ff1c187713fc89e84330f.svg","isPro":false,"fullname":"MBerger","user":"SHakeShakeShake","type":"user"},{"_id":"682c163fa17480053339f270","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UQZW2bM9UEtlVHDUq0yel.png","isPro":false,"fullname":"Yicheng Xu","user":"linghan199","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12195.md"}">

Papers

arxiv:2606.12195

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Published on Jun 10

· Submitted by

taesiri on Jun 11

Upvote

Authors:

Ziang Yan ,

Tianxiang Jiang ,

Yicheng Xu ,

Abstract

InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.