Hugging Face Daily Papers · · 4 min read

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Watch, Remember, Reason: Human-View Video Understanding with MLLMs</p>\n","updatedAt":"2026-06-08T03:15:24.022Z","author":{"_id":"65a28e129acab19980226731","avatarUrl":"/avatars/abc3828f807efc4e03837b0eae063f98.svg","fullname":"Jiahao Meng","name":"marinero4972","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7925809025764465},"editors":["marinero4972"],"editorAvatarUrls":["/avatars/abc3828f807efc4e03837b0eae063f98.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.07433","authors":[{"_id":"6a26335be4c258a029492004","name":"Jiahao Meng","hidden":false},{"_id":"6a26335be4c258a029492005","name":"Yue Tan","hidden":false},{"_id":"6a26335be4c258a029492006","name":"Qi Xu","hidden":false},{"_id":"6a26335be4c258a029492007","name":"Kuan Gao","hidden":false},{"_id":"6a26335be4c258a029492008","name":"Weisong Liu","hidden":false},{"_id":"6a26335be4c258a029492009","name":"Yanwei Li","hidden":false},{"_id":"6a26335be4c258a02949200a","name":"Jason Li","hidden":false},{"_id":"6a26335be4c258a02949200b","name":"Lingdong Kong","hidden":false},{"_id":"6a26335be4c258a02949200c","name":"Haochen Wang","hidden":false},{"_id":"6a26335be4c258a02949200d","name":"Qianyu Zhou","hidden":false},{"_id":"6a26335be4c258a02949200e","name":"Jiangning Zhang","hidden":false},{"_id":"6a26335be4c258a02949200f","name":"Guangliang Cheng","hidden":false},{"_id":"6a26335be4c258a029492010","name":"Yunhai Tong","hidden":false},{"_id":"6a26335be4c258a029492011","name":"Lu Qi","hidden":false},{"_id":"6a26335be4c258a029492012","name":"Minghsuan Yang","hidden":false}],"publishedAt":"2026-06-05T00:00:00.000Z","submittedOnDailyAt":"2026-06-08T00:00:00.000Z","title":"Watch, Remember, Reason: Human-View Video Understanding with MLLMs","submittedOnDailyBy":{"_id":"65a28e129acab19980226731","avatarUrl":"/avatars/abc3828f807efc4e03837b0eae063f98.svg","isPro":false,"fullname":"Jiahao Meng","user":"marinero4972","type":"user","name":"marinero4972"},"summary":"Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.","upvotes":12,"discussionId":"6a26335ce4c258a029492013","githubRepo":"https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding","githubRepoAddedBy":"user","ai_summary":"Multimodal large language models for video understanding are structured around three core capabilities—watching, remembering, and reasoning—with applications spanning multiple video domains and addressing challenges in perception, memory, and reasoning.","ai_keywords":["multimodal large language models","video understanding","spatio-temporal perception","long-range dependencies","multimodal alignment","memory modeling","streaming understanding","faithful reasoning","perceptual representations","memory states","reasoning traces","video MLLMs","egocentric videos","sports videos","instructional videos","medical videos","narrative videos"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":8,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65a28e129acab19980226731","avatarUrl":"/avatars/abc3828f807efc4e03837b0eae063f98.svg","isPro":false,"fullname":"Jiahao Meng","user":"marinero4972","type":"user"},{"_id":"63fa1f88d38275b44359398d","avatarUrl":"/avatars/af7a826e59263dd8272368927a6930fb.svg","isPro":false,"fullname":"APRIL-AIGC","user":"APRIL-AIGC","type":"user"},{"_id":"63958b4414513eaf9029ebf1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/U1g5H071pWRswGAG9UTpo.png","isPro":false,"fullname":"Xiangtai Li","user":"LXT","type":"user"},{"_id":"643ff78cdc984afcbbbc3b1a","avatarUrl":"/avatars/eec5198ce88aaf8156840bec0d190a7f.svg","isPro":false,"fullname":"Yanwei Li","user":"YanweiLi","type":"user"},{"_id":"67a1c770bb894e8b19246698","avatarUrl":"/avatars/38c6370f3845acc4fab334bf8088ec3e.svg","isPro":false,"fullname":"Tan Yue","user":"TTangenty","type":"user"},{"_id":"67a5ae48166721c8f99f8dac","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/QogTIecNITyOGrsqkJzWK.png","isPro":false,"fullname":"Yimin Wang","user":"99sweetcookie","type":"user"},{"_id":"691a8a2858f216be4d697f81","avatarUrl":"/avatars/816cd86d6c5c26174139a56e335610fe.svg","isPro":false,"fullname":"hopeymir","user":"hopeymir","type":"user"},{"_id":"6809a215d1b1e0758d74142d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/HQaINIuC0nd4Xa5_3_ma9.png","isPro":false,"fullname":"Luo Yinze","user":"0-693","type":"user"},{"_id":"65642d7401de72cb63165d22","avatarUrl":"/avatars/1f4417c4ac5e781ce73eae1060e3f7f2.svg","isPro":false,"fullname":"ytaewon","user":"hamzzi","type":"user"},{"_id":"69e837adc8bf162f8c48730e","avatarUrl":"/avatars/4b814997824965593029beb4ead7852f.svg","isPro":false,"fullname":"Monitor","user":"KK-monitor","type":"user"},{"_id":"6902c0711cbf187eb72ff30b","avatarUrl":"/avatars/bc35ca66abf84b08fb3ab43fe201605c.svg","isPro":true,"fullname":"MLLM","user":"Anran-MLLM","type":"user"},{"_id":"6580440ae77395a0c8399477","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6580440ae77395a0c8399477/hTNLwyfpHDTQ_fDAR6d7k.jpeg","isPro":false,"fullname":"XuQi","user":"insomnia7","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"61dcd8e344f59573371b5cb6","name":"PekingUniversity","fullname":"Peking University","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/vavgrBsnkSejriUF4lXDE.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.07433.md"}">
Papers
arxiv:2606.07433

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Published on Jun 5
· Submitted by
Jiahao Meng
on Jun 8
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Multimodal large language models for video understanding are structured around three core capabilities—watching, remembering, and reasoning—with applications spanning multiple video domains and addressing challenges in perception, memory, and reasoning.

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

Community

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.07433
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.07433 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.07433 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.07433 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers