Hugging Face Daily Papers · · 3 min read

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

A framework for long-horizon video understanding via closed-loop contextual reasoning and efficient latent attention.</p>\n<p>Github: <a href=\"https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo3\" rel=\"nofollow\">https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo3</a></p>\n","updatedAt":"2026-06-11T02:37:57.174Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":314,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.6941334009170532},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.12195","authors":[{"_id":"6a2a1f6480a9c7c6830c0ee3","user":{"_id":"68d3a8cc20a82c2ec48f2044","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xXFan0XPalNWLdYeSl89r.png","isPro":false,"fullname":"Ziang Yan","user":"yanziang","type":"user","name":"yanziang"},"name":"Ziang Yan","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:32.374Z","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee4","name":"Sheng Xia","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee5","name":"Jiashuo Yu","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee6","name":"Yue Wu","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee7","user":{"_id":"6744754ff9940208b97a6a9a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6744754ff9940208b97a6a9a/PRG6_0jAfsj0uoUJvKyWf.png","isPro":false,"fullname":"Tianxiang Jiang","user":"Eurayka","type":"user","name":"Eurayka"},"name":"Tianxiang Jiang","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:28.638Z","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee8","name":"Songze Li","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0ee9","name":"Kanghui Tian","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eea","user":{"_id":"682c163fa17480053339f270","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UQZW2bM9UEtlVHDUq0yel.png","isPro":false,"fullname":"Yicheng Xu","user":"linghan199","type":"user","name":"linghan199"},"name":"Yicheng Xu","status":"claimed_verified","statusLastChangedAt":"2026-06-11T08:38:26.059Z","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eeb","name":"Yinan He","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eec","name":"Kai Chen","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eed","name":"Limin Wang","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eee","name":"Yu Qiao","hidden":false},{"_id":"6a2a1f6480a9c7c6830c0eef","name":"Yi Wang","hidden":false}],"publishedAt":"2026-06-10T00:00:00.000Z","submittedOnDailyAt":"2026-06-11T00:00:00.000Z","title":"InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.","upvotes":17,"discussionId":"6a2a1f6480a9c7c6830c0ef0","ai_summary":"InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities.","ai_keywords":["Multimodal Contextual Reasoning","M^2LA","KV-cache states","token-preserving reparameterization","staged training","continued pretraining","supervised fine-tuning","reinforcement learning","on-policy distillation","video agent","evidence accumulation","closed-loop reasoning"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct"},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"68d3a8cc20a82c2ec48f2044","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/xXFan0XPalNWLdYeSl89r.png","isPro":false,"fullname":"Ziang Yan","user":"yanziang","type":"user"},{"_id":"62aafa49f29ff279b51f0182","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62aafa49f29ff279b51f0182/rQx8QFQGOY2qIhqJ8zSRj.jpeg","isPro":false,"fullname":"yinanhe","user":"ynhe","type":"user"},{"_id":"66ab3eaf1f83b210aeb4facf","avatarUrl":"/avatars/eaa5cc53acd8e39812d6b4758209ce23.svg","isPro":false,"fullname":"changsong","user":"downdric","type":"user"},{"_id":"6744754ff9940208b97a6a9a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6744754ff9940208b97a6a9a/PRG6_0jAfsj0uoUJvKyWf.png","isPro":false,"fullname":"Tianxiang Jiang","user":"Eurayka","type":"user"},{"_id":"654b380bdecdf18913db982d","avatarUrl":"/avatars/14698b6a532828a615e3bc41d62518f7.svg","isPro":false,"fullname":"Wu Yue","user":"May010129","type":"user"},{"_id":"6708ca2bfd7dc0bbf9e7c156","avatarUrl":"/avatars/d971968882199e003434b0f4a0a4b63a.svg","isPro":false,"fullname":"Travis Xia(sii)","user":"travis-xia","type":"user"},{"_id":"634263017a0225764f4801e4","avatarUrl":"/avatars/0740202cf0d047f7d3c3fbb35893f76e.svg","isPro":false,"fullname":"zqlai","user":"Laizhengqin","type":"user"},{"_id":"63119a987680dc699b2031df","avatarUrl":"/avatars/db63ab73806ee9cb8aa01be82d9effdd.svg","isPro":false,"fullname":"Yi Wang","user":"shepnerd","type":"user"},{"_id":"6407e5294edf9f5c4fd32228","avatarUrl":"/avatars/8e2d55460e9fe9c426eb552baf4b2cb0.svg","isPro":false,"fullname":"Stoney Kang","user":"sikang99","type":"user"},{"_id":"65c4eb7cd1dcbd30d86febec","avatarUrl":"/avatars/001c8f02e8ce794b2c21883628b2da72.svg","isPro":false,"fullname":"free-bit","user":"free-bit","type":"user"},{"_id":"6953897fa6ebf89c814f4cc5","avatarUrl":"/avatars/5f287f9e303ff1c187713fc89e84330f.svg","isPro":false,"fullname":"MBerger","user":"SHakeShakeShake","type":"user"},{"_id":"682c163fa17480053339f270","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/UQZW2bM9UEtlVHDUq0yel.png","isPro":false,"fullname":"Yicheng Xu","user":"linghan199","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.12195.md"}">
Papers
arxiv:2606.12195

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

Published on Jun 10
· Submitted by
taesiri
on Jun 11
Authors:
,
,
,
,
,
,
,
,
,

Abstract

InternVideo3 enhances long-horizon multimodal tasks through Multimodal Contextual Reasoning and efficient attention mechanisms, demonstrating strong performance on video understanding benchmarks and video agent capabilities.

Recent progress in foundation models has shifted toward agentic behavior involving multi-step reasoning and tool use. However, open-source efforts largely focus on text-dominant settings, leaving long-horizon multimodal tasks underexplored. This gap is evident in video tasks requiring sustained temporal understanding and iterative interaction. We present InternVideo3, a framework enhancing these capabilities via Multimodal Contextual Reasoning (MCR). MCR treats understanding as a closed-loop process over a shared, evolving context containing observations, instructions, reasoning, tool actions, and memory. This frames long-video understanding as evidence accumulation and verification. To ensure efficiency, we introduce Multimodal Multi-head Latent Attention (M^2LA), a token-preserving reparameterization compressing KV-cache states while retaining the full token stream. Our staged training includes continued pretraining, short-to-long supervised fine-tuning, rule-based reinforcement learning, and on-policy distillation. Experiments show InternVideo3 achieves strong performance on benchmarks like Video-MME, MLVU, and EgoSchema. We further instantiate the model as a video agent with retrieval tools, demonstrating robust evidence-grounded behavior. Our results suggest that efficient context handling and closed-loop reasoning are vital for adapting open multimodal models toward long-horizon visually grounded agency.

Community

Paper submitter about 17 hours ago

A framework for long-horizon video understanding via closed-loop contextual reasoning and efficient latent attention.

Github: https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo3

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.12195
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12195 in a Space README.md to link it from this page.

Collections including this paper 3

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers