We have open-sourced our code and dataset. Please check out our GitHub repository:<br><a href=\"https://github.com/xrenaf/MEMLENS\" rel=\"nofollow\">https://github.com/xrenaf/MEMLENS</a></p>\n","updatedAt":"2026-05-15T03:02:26.531Z","author":{"_id":"65d6d40ceb10c87461f46455","avatarUrl":"/avatars/7ebfac327878332d18f93a95d8456d40.svg","fullname":"Xiyu Ren","name":"xiyuRenBill","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7927830815315247},"editors":["xiyuRenBill"],"editorAvatarUrls":["/avatars/7ebfac327878332d18f93a95d8456d40.svg"],"reactions":[],"isReport":false}},{"id":"6a069319a1056d5ca9b7efa1","author":{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","fullname":"Zhaowei Wang","name":"ZhaoweiWang","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false},"createdAt":"2026-05-15T03:29:29.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"We would greatly appreciate it if you could upvote.","html":"<p>We would greatly appreciate it if you could upvote.</p>\n","updatedAt":"2026-05-15T03:29:29.950Z","author":{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","fullname":"Zhaowei Wang","name":"ZhaoweiWang","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":8,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9889910221099854},"editors":["ZhaoweiWang"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg"],"reactions":[],"isReport":false}},{"id":"6a06e3b4d07ce83ca13daa19","author":{"_id":"665d84f901724a21c2115fc2","avatarUrl":"/avatars/1ac8d3b1d041bf7f4699204d1ce3b546.svg","fullname":"ZebangCheng","name":"ZebangCheng","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false},"createdAt":"2026-05-15T09:13:24.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is a timely and valuable benchmark. I really like that MEMLENS focuses on **multimodal memory across multi-session conversations**, rather than only long text context. The comparison between long-context LVLMs and memory-augmented agents is also very meaningful, and the image-ablation results clearly show that visual evidence is truly necessary.\n\nI’m very interested in this work. One small question: **How are the key evidence images distributed across sessions—are they concentrated in a few sessions or intentionally scattered throughout the conversation?**","html":"<p>This is a timely and valuable benchmark. I really like that MEMLENS focuses on <strong>multimodal memory across multi-session conversations</strong>, rather than only long text context. The comparison between long-context LVLMs and memory-augmented agents is also very meaningful, and the image-ablation results clearly show that visual evidence is truly necessary.</p>\n<p>I’m very interested in this work. One small question: <strong>How are the key evidence images distributed across sessions—are they concentrated in a few sessions or intentionally scattered throughout the conversation?</strong></p>\n","updatedAt":"2026-05-15T09:13:24.366Z","author":{"_id":"665d84f901724a21c2115fc2","avatarUrl":"/avatars/1ac8d3b1d041bf7f4699204d1ce3b546.svg","fullname":"ZebangCheng","name":"ZebangCheng","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":6,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9414899349212646},"editors":["ZebangCheng"],"editorAvatarUrls":["/avatars/1ac8d3b1d041bf7f4699204d1ce3b546.svg"],"reactions":[{"reaction":"👍","users":["ZebangCheng"],"count":1}],"isReport":false},"replies":[{"id":"6a06eb916808e95e0e6a7a4d","author":{"_id":"65d6d40ceb10c87461f46455","avatarUrl":"/avatars/7ebfac327878332d18f93a95d8456d40.svg","fullname":"Xiyu Ren","name":"xiyuRenBill","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false},"createdAt":"2026-05-15T09:46:57.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Thank you for your question. We maintain a uniform image-to-text token ratio to prevent images from being overly concentrated in a small number of sessions. More details about this design are provided in our paper. This procedure helps avoid potential shortcuts caused by image concentration and contributes to a more coherent and balanced dataset.","html":"<p>Thank you for your question. We maintain a uniform image-to-text token ratio to prevent images from being overly concentrated in a small number of sessions. More details about this design are provided in our paper. This procedure helps avoid potential shortcuts caused by image concentration and contributes to a more coherent and balanced dataset.</p>\n","updatedAt":"2026-05-15T09:46:57.241Z","author":{"_id":"65d6d40ceb10c87461f46455","avatarUrl":"/avatars/7ebfac327878332d18f93a95d8456d40.svg","fullname":"Xiyu Ren","name":"xiyuRenBill","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9010499119758606},"editors":["xiyuRenBill"],"editorAvatarUrls":["/avatars/7ebfac327878332d18f93a95d8456d40.svg"],"reactions":[{"reaction":"👍","users":["ZebangCheng"],"count":1}],"isReport":false,"parentCommentId":"6a06e3b4d07ce83ca13daa19"}}]}],"primaryEmailConfirmed":false,"paper":{"id":"2605.14906","authors":[{"_id":"6a068bc2b1a8cbabc9f09921","name":"Xiyu Ren","hidden":false},{"_id":"6a068bc2b1a8cbabc9f09922","name":"Zhaowei Wang","hidden":false},{"_id":"6a068bc2b1a8cbabc9f09923","name":"Yiming Du","hidden":false},{"_id":"6a068bc2b1a8cbabc9f09924","name":"Zhongwei Xie","hidden":false},{"_id":"6a068bc2b1a8cbabc9f09925","name":"Chi Liu","hidden":false},{"_id":"6a068bc2b1a8cbabc9f09926","name":"Xinlin Yang","hidden":false},{"_id":"6a068bc2b1a8cbabc9f09927","name":"Haoyue Feng","hidden":false},{"_id":"6a068bc2b1a8cbabc9f09928","name":"Wenjun Pan","hidden":false},{"_id":"6a068bc2b1a8cbabc9f09929","name":"Tianshi Zheng","hidden":false},{"_id":"6a068bc2b1a8cbabc9f0992a","name":"Baixuan Xu","hidden":false},{"_id":"6a068bc2b1a8cbabc9f0992b","name":"Zhengnan Li","hidden":false},{"_id":"6a068bc2b1a8cbabc9f0992c","name":"Yangqiu Song","hidden":false},{"_id":"6a068bc2b1a8cbabc9f0992d","name":"Ginny Wong","hidden":false},{"_id":"6a068bc2b1a8cbabc9f0992e","name":"Simon See","hidden":false}],"publishedAt":"2026-05-14T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models","submittedOnDailyBy":{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","isPro":true,"fullname":"Zhaowei Wang","user":"ZhaoweiWang","type":"user","name":"ZhaoweiWang"},"summary":"Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.","upvotes":62,"discussionId":"6a068bc2b1a8cbabc9f0992f","githubRepo":"https://github.com/xrenaf/MEMLENS","githubRepoAddedBy":"user","ai_summary":"A new benchmark evaluates memory capabilities in vision-language models through multi-session conversations, revealing limitations of both long-context and memory-augmented approaches.","ai_keywords":["vision-language models","long-context LVLMs","memory-augmented agents","multimodal multi-session conversations","memory abilities","cross-modal token-counting","visual evidence","multi-session reasoning","structured multimodal retrieval"],"githubStars":14,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65d6d40ceb10c87461f46455","avatarUrl":"/avatars/7ebfac327878332d18f93a95d8456d40.svg","isPro":false,"fullname":"Xiyu Ren","user":"xiyuRenBill","type":"user"},{"_id":"68954ff742413a5846d9d80a","avatarUrl":"/avatars/1751b607a0a65ba63400d34487f3b75c.svg","isPro":false,"fullname":"ElvinDu","user":"ElvinDu518","type":"user"},{"_id":"62281c11236b7b2eefa7f198","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62281c11236b7b2eefa7f198/O-LoLaDkIoWcP19mzkNgS.jpeg","isPro":true,"fullname":"Zhaowei Wang","user":"ZhaoweiWang","type":"user"},{"_id":"6a0694b3787034fe682fd703","avatarUrl":"/avatars/03cc951c7a631139d863f3c8ab18572f.svg","isPro":false,"fullname":"Hu Chenyu","user":"cheneyhu33","type":"user"},{"_id":"676bf27f5af9b77aeb1bd3fa","avatarUrl":"/avatars/34b76532404f1535f04fab564b5bd968.svg","isPro":false,"fullname":"zhengnan li","user":"fmlyd1","type":"user"},{"_id":"66a01fbe2d4cb024d5057a61","avatarUrl":"/avatars/6e1f96d81a5e03d00a7876ab489f974f.svg","isPro":false,"fullname":"Xianglong Xu","user":"XiangL","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"6a06b2239a365c5252684f12","avatarUrl":"/avatars/49c96cff15f8ad4ca98537ef1d9368f0.svg","isPro":false,"fullname":"Zhang","user":"Meioo","type":"user"},{"_id":"6a06b2900fe796b5bfe7772c","avatarUrl":"/avatars/e3dc8e661e7181d171ec54a5669b79b8.svg","isPro":false,"fullname":"zhang","user":"alice021300-rgb","type":"user"},{"_id":"6a06b2d3d53c871eceb5e9a6","avatarUrl":"/avatars/cae042440f165b0b8f6ac897dbbe35e5.svg","isPro":false,"fullname":"chenzejun","user":"chenzejun","type":"user"},{"_id":"60d00ab743c9305fa34dd885","avatarUrl":"/avatars/7e294b5f0b2bb5f0732ca44079f3f24d.svg","isPro":false,"fullname":"Steven Zhao","user":"StevenZhao","type":"user"},{"_id":"67ab136301d2e23940c4aa47","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/d5UqJNYNqfSijeisl1_SO.png","isPro":false,"fullname":"高梓涵","user":"GAOTHU","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"60262b67268c201cdc8b7d43","name":"nvidia","fullname":"NVIDIA","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65df9200dc3292a8983e5017/Vs5FPVCH-VZBipV3qKTuy.png"}}">
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
A new benchmark evaluates memory capabilities in vision-language models through multi-session conversations, revealing limitations of both long-context and memory-augmented approaches.
AI-generated summary
Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.
Community
We would greatly appreciate it if you could upvote.
This is a timely and valuable benchmark. I really like that MEMLENS focuses on multimodal memory across multi-session conversations, rather than only long text context. The comparison between long-context LVLMs and memory-augmented agents is also very meaningful, and the image-ablation results clearly show that visual evidence is truly necessary.
I’m very interested in this work. One small question: How are the key evidence images distributed across sessions—are they concentrated in a few sessions or intentionally scattered throughout the conversation?
Thank you for your question. We maintain a uniform image-to-text token ratio to prevent images from being overly concentrated in a small number of sessions. More details about this design are provided in our paper. This procedure helps avoid potential shortcuts caused by image concentration and contributes to a more coherent and balanced dataset.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.14906 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.14906 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.14906 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.