Hugging Face Daily Papers · · 3 min read

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Code: <a href=\"https://github.com/PKU-VaLuE-Lab/m3eval\" rel=\"nofollow\">https://github.com/PKU-VaLuE-Lab/m3eval</a></p>\n","updatedAt":"2026-06-04T07:21:45.757Z","author":{"_id":"65bda9874b5f8c270de11440","avatarUrl":"/avatars/353f33c198752a634b8e6a422aa8008d.svg","fullname":"Huang Jie","name":"JadeHuang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":1,"identifiedLanguage":{"language":"en","probability":0.8823664784431458},"editors":["JadeHuang"],"editorAvatarUrls":["/avatars/353f33c198752a634b8e6a422aa8008d.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.05008","authors":[{"_id":"6a20dfbd15100c5272a8466c","name":"Jie Huang","hidden":false},{"_id":"6a20dfbd15100c5272a8466d","name":"Ruixun Liu","hidden":false},{"_id":"6a20dfbd15100c5272a8466e","name":"Sirui Sun","hidden":false},{"_id":"6a20dfbd15100c5272a8466f","name":"Xinyi Yang","hidden":false},{"_id":"6a20dfbd15100c5272a84670","name":"Yin Li","hidden":false},{"_id":"6a20dfbd15100c5272a84671","name":"Yixin Zhu","hidden":false},{"_id":"6a20dfbd15100c5272a84672","name":"Yiwu Zhong","hidden":false}],"publishedAt":"2026-06-03T00:00:00.000Z","submittedOnDailyAt":"2026-06-04T00:00:00.000Z","title":"M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks","submittedOnDailyBy":{"_id":"65bda9874b5f8c270de11440","avatarUrl":"/avatars/353f33c198752a634b8e6a422aa8008d.svg","isPro":false,"fullname":"Huang Jie","user":"JadeHuang","type":"user","name":"JadeHuang"},"summary":"As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.","upvotes":19,"discussionId":"6a20dfbd15100c5272a84673","projectPage":"https://pku-value-lab.github.io/m3eval-homepage/","githubRepo":"https://github.com/PKU-VaLuE-Lab/m3eval","githubRepoAddedBy":"user","ai_summary":"Multi-modal models exhibit significant limitations in memory capabilities, particularly in maintaining disentangled representations and demonstrating human-like interference patterns, highlighting the need for improved memory mechanisms in video understanding systems.","ai_keywords":["multi-modal models","video understanding","memory","cognitive psychology","evaluation framework","benchmark","disentangled representations","interference patterns","spatial domain","temporal domain","symbolic memory"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3,"organization":{"_id":"6a07ff936781d8b803c28343","name":"PKU-VaLuE-Lab","fullname":"PKU-VaLuE-Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b4fe7ab25cb80fcf2ffd66/velgm3kH_oJ2b0aoapBBM.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65bda9874b5f8c270de11440","avatarUrl":"/avatars/353f33c198752a634b8e6a422aa8008d.svg","isPro":false,"fullname":"Huang Jie","user":"JadeHuang","type":"user"},{"_id":"62b4fe7ab25cb80fcf2ffd66","avatarUrl":"/avatars/22f6a05cdbf4224d29ec9259c9fdd7a4.svg","isPro":false,"fullname":"Yiwu Zhong","user":"YiwuZhong","type":"user"},{"_id":"6a17d72c228b76931222616a","avatarUrl":"/avatars/6f54893366a2c694afbd927dee02e6aa.svg","isPro":false,"fullname":"Shuwei Jiang","user":"Sqhquqwqeqi","type":"user"},{"_id":"68328d8cbf44a0b922fa453b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/68328d8cbf44a0b922fa453b/SNQAsDZ0vEqle4TUudXR7.jpeg","isPro":false,"fullname":"Mulinjushi","user":"Mulinjushi","type":"user"},{"_id":"67e6c8e1c5bc33ffc1c55716","avatarUrl":"/avatars/dc9e13e4dbf94c930c19273ceaf3c926.svg","isPro":false,"fullname":"Hrchen","user":"Kndy666","type":"user"},{"_id":"692fa3bd31a14e1c595d18e0","avatarUrl":"/avatars/d615b97fd40f648ff25eb69ae90a7e66.svg","isPro":false,"fullname":"zeyu xia","user":"HsiaVyse87","type":"user"},{"_id":"694914e3dc5836005c061f68","avatarUrl":"/avatars/7662e405c0b1530044c15c75814c2ac8.svg","isPro":false,"fullname":"Shuaijia Wang","user":"spluswang","type":"user"},{"_id":"66935bdc5489e4f73c76bc7b","avatarUrl":"/avatars/129d1e86bbaf764b507501f4feb177db.svg","isPro":false,"fullname":"Abidoye Aanuoluwapo","user":"Aanuoluwapo65","type":"user"},{"_id":"67bf084c621eed10e557626c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/67bf084c621eed10e557626c/wIXIQwy-yRxi-g8ns61Fa.png","isPro":true,"fullname":"YONGCHANG ZHANG","user":"YONGCHANG-ZHANG","type":"user"},{"_id":"637a2005bdf7309aa6d46c79","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1668947933107-noauth.jpeg","isPro":false,"fullname":"Guangyu Zhao","user":"frenzyfreeze","type":"user"},{"_id":"65a8cfb15e49cc9fdc7dcd3e","avatarUrl":"/avatars/ae4616f5b457da5b2a636215ef3d59ed.svg","isPro":false,"fullname":"Yuxin Zhang","user":"EricLHK","type":"user"},{"_id":"62b5613c5383da04f574af58","avatarUrl":"/avatars/2a7dbf81a454a93dd34a3534156aca1c.svg","isPro":false,"fullname":"Wingle","user":"WingK","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6a07ff936781d8b803c28343","name":"PKU-VaLuE-Lab","fullname":"PKU-VaLuE-Lab","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/62b4fe7ab25cb80fcf2ffd66/velgm3kH_oJ2b0aoapBBM.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.05008.md"}">
Papers
arxiv:2606.05008

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Published on Jun 3
· Submitted by
Huang Jie
on Jun 4
Authors:
,
,
,
,
,
,

Abstract

Multi-modal models exhibit significant limitations in memory capabilities, particularly in maintaining disentangled representations and demonstrating human-like interference patterns, highlighting the need for improved memory mechanisms in video understanding systems.

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.05008
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.05008 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.05008 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers