WorldMemArena is a new benchmark evaluating the multimodal memory of long-horizon agents using a four-stage Action-World Interaction Loop and multi-session tasks for detailed performance diagnostics.</p>\n","updatedAt":"2026-05-29T03:19:24.337Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":307,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7427688837051392},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1a409d4587a78f8e52944d","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:42:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MemReader: From Passive to Active Extraction for Long-Term Agent Memory](https://huggingface.co/papers/2604.07877) (2026)\n* [MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts](https://huggingface.co/papers/2605.20926) (2026)\n* [Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions](https://huggingface.co/papers/2605.26256) (2026)\n* [MemGym: a Long-Horizon Memory Environment for LLM Agents](https://huggingface.co/papers/2605.20833) (2026)\n* [MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents](https://huggingface.co/papers/2605.18652) (2026)\n* [When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory](https://huggingface.co/papers/2605.07313) (2026)\n* [LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues](https://huggingface.co/papers/2605.12493) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"<p>This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. </p>\n<p>The following papers were recommended by the Semantic Scholar API </p>\n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.07877\">MemReader: From Passive to Active Extraction for Long-Term Agent Memory</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.20926\">MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.26256\">Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.20833\">MemGym: a Long-Horizon Memory Environment for LLM Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18652\">MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07313\">When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12493\">LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues</a> (2026)</li>\n</ul>\n<p> Please give a thumbs up to this comment if you found it helpful!</p>\n<p> If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space</p>\n<p> You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><span class=\"SVELTE_PARTIAL_HYDRATER contents\" data-target=\"UserMention\" data-props=\"{"user":"librarian-bot"}\"><span class=\"inline-block\"><span class=\"contents\"><a href=\"/librarian-bot\">@<span class=\"underline\">librarian-bot</span></a></span> </span></span> recommend</code></p>\n","updatedAt":"2026-05-30T01:42:53.607Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7063395380973816},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29341","authors":[{"_id":"6a1905a956b4bb14ec65cf82","name":"Chengzhi Liu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf83","name":"Yuzhe Yang","hidden":false},{"_id":"6a1905a956b4bb14ec65cf84","name":"Sophia Xiao Pu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf85","name":"Yepeng Liu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf86","name":"Lin Long","hidden":false},{"_id":"6a1905a956b4bb14ec65cf87","name":"Yichen Guo","hidden":false},{"_id":"6a1905a956b4bb14ec65cf88","name":"Nuo Chen","hidden":false},{"_id":"6a1905a956b4bb14ec65cf89","name":"Zhaotian Weng","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8a","name":"Elena Kochkina","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8b","name":"Simerjot Kaur","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8c","name":"Charese Smiley","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8d","name":"Xiaomo Liu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8e","name":"James Zou","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8f","name":"Sheng Liu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf90","name":"Yuheng Bu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf91","name":"Songyou Peng","hidden":false},{"_id":"6a1905a956b4bb14ec65cf92","name":"Xin Eric Wang","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.","upvotes":4,"discussionId":"6a1905aa56b4bb14ec65cf93","projectPage":"https://worldmemarena-mem.github.io/","githubRepo":"https://github.com/UCSB-AI/WorldMemArena","githubRepoAddedBy":"user","ai_summary":"Multimodal large language models require sophisticated memory systems that can track evolving environments and manage information dynamically across multiple sessions, with new benchmarks revealing limitations in current approaches.","ai_keywords":["multimodal large language models","long-horizon agents","memory management","Action-World Interaction Loop","WorldMemArena","multimodal tasks","Lifelong Evolution","Agentic Execution","memory writing","memory storage","RAG","external memory systems","harness-based memory agents","visual evidence","agentic trajectories"],"githubStars":9},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63f3fc83520c1461892d323e","avatarUrl":"/avatars/bcfe9d170c249492a5e0badaa9ac2325.svg","isPro":false,"fullname":"Yepeng Liu","user":"yepengliu","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"619f9755da83161f25840698","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/619f9755da83161f25840698/FM421pE1mz5v1YhrxA8ZA.jpeg","isPro":false,"fullname":"Muhammad Umair","user":"umair894","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29341.md"}">
WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Multimodal large language models require sophisticated memory systems that can track evolving environments and manage information dynamically across multiple sessions, with new benchmarks revealing limitations in current approaches.
AI-generated summary
Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.
Community
WorldMemArena is a new benchmark evaluating the multimodal memory of long-horizon agents using a four-stage Action-World Interaction Loop and multi-session tasks for detailed performance diagnostics.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.29341 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.29341 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.29341 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.