Hugging Face Daily Papers · May 29, 2026 · 6 min read

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

WorldMemArena is a new benchmark evaluating the multimodal memory of long-horizon agents using a four-stage Action-World Interaction Loop and multi-session tasks for detailed performance diagnostics.\n","updatedAt":"2026-05-29T03:19:24.337Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":307,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7427688837051392},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}},{"id":"6a1a409d4587a78f8e52944d","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false},"createdAt":"2026-05-30T01:42:53.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"This is an automated message from the [Librarian Bot](https://huggingface.co/librarian-bots). I found the following papers similar to this paper. \n\nThe following papers were recommended by the Semantic Scholar API \n\n* [MemReader: From Passive to Active Extraction for Long-Term Agent Memory](https://huggingface.co/papers/2604.07877) (2026)\n* [MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts](https://huggingface.co/papers/2605.20926) (2026)\n* [Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions](https://huggingface.co/papers/2605.26256) (2026)\n* [MemGym: a Long-Horizon Memory Environment for LLM Agents](https://huggingface.co/papers/2605.20833) (2026)\n* [MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents](https://huggingface.co/papers/2605.18652) (2026)\n* [When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory](https://huggingface.co/papers/2605.07313) (2026)\n* [LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues](https://huggingface.co/papers/2605.12493) (2026)\n\n\n Please give a thumbs up to this comment if you found it helpful!\n\n If you want recommendations for any Paper on Hugging Face checkout [this](https://huggingface.co/spaces/librarian-bots/recommend_similar_papers) Space\n\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: `@librarian-bot recommend`","html":"This is an automated message from the <a href=\"https://huggingface.co/librarian-bots\">Librarian Bot</a>. I found the following papers similar to this paper. \nThe following papers were recommended by the Semantic Scholar API \n<ul>\n<li><a href=\"https://huggingface.co/papers/2604.07877\">MemReader: From Passive to Active Extraction for Long-Term Agent Memory</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.20926\">MemConflict: Evaluating Long-Term Memory Systems Under Memory Conflicts</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.26256\">Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.20833\">MemGym: a Long-Horizon Memory Environment for LLM Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.18652\">MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.07313\">When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory</a> (2026)</li>\n<li><a href=\"https://huggingface.co/papers/2605.12493\">LongMemEval-V2: Evaluating Long-Term Agent Memory Toward Experienced Colleagues</a> (2026)</li>\n</ul>\n Please give a thumbs up to this comment if you found it helpful!\n If you want recommendations for any Paper on Hugging Face checkout <a href=\"https://huggingface.co/spaces/librarian-bots/recommend_similar_papers\">this</a> Space\n You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: <code><a href=\"/librarian-bot\">@librarian-bot</a> recommend</code>\n","updatedAt":"2026-05-30T01:42:53.607Z","author":{"_id":"63d3e0e8ff1384ce6c5dd17d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg","fullname":"Librarian Bot (Bot)","name":"librarian-bot","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":359,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7063395380973816},"editors":["librarian-bot"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1674830754237-63d3e0e8ff1384ce6c5dd17d.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29341","authors":[{"_id":"6a1905a956b4bb14ec65cf82","name":"Chengzhi Liu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf83","name":"Yuzhe Yang","hidden":false},{"_id":"6a1905a956b4bb14ec65cf84","name":"Sophia Xiao Pu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf85","name":"Yepeng Liu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf86","name":"Lin Long","hidden":false},{"_id":"6a1905a956b4bb14ec65cf87","name":"Yichen Guo","hidden":false},{"_id":"6a1905a956b4bb14ec65cf88","name":"Nuo Chen","hidden":false},{"_id":"6a1905a956b4bb14ec65cf89","name":"Zhaotian Weng","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8a","name":"Elena Kochkina","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8b","name":"Simerjot Kaur","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8c","name":"Charese Smiley","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8d","name":"Xiaomo Liu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8e","name":"James Zou","hidden":false},{"_id":"6a1905a956b4bb14ec65cf8f","name":"Sheng Liu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf90","name":"Yuheng Bu","hidden":false},{"_id":"6a1905a956b4bb14ec65cf91","name":"Songyou Peng","hidden":false},{"_id":"6a1905a956b4bb14ec65cf92","name":"Xin Eric Wang","hidden":false}],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-05-29T00:00:00.000Z","title":"WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.","upvotes":4,"discussionId":"6a1905aa56b4bb14ec65cf93","projectPage":"https://worldmemarena-mem.github.io/","githubRepo":"https://github.com/UCSB-AI/WorldMemArena","githubRepoAddedBy":"user","ai_summary":"Multimodal large language models require sophisticated memory systems that can track evolving environments and manage information dynamically across multiple sessions, with new benchmarks revealing limitations in current approaches.","ai_keywords":["multimodal large language models","long-horizon agents","memory management","Action-World Interaction Loop","WorldMemArena","multimodal tasks","Lifelong Evolution","Agentic Execution","memory writing","memory storage","RAG","external memory systems","harness-based memory agents","visual evidence","agentic trajectories"],"githubStars":9},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"63f3fc83520c1461892d323e","avatarUrl":"/avatars/bcfe9d170c249492a5e0badaa9ac2325.svg","isPro":false,"fullname":"Yepeng Liu","user":"yepengliu","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"},{"_id":"619f9755da83161f25840698","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/619f9755da83161f25840698/FM421pE1mz5v1YhrxA8ZA.jpeg","isPro":false,"fullname":"Muhammad Umair","user":"umair894","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29341.md"}">

Papers

arxiv:2605.29341

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Published on May 28

· Submitted by

taesiri on May 29

Upvote

Authors:

Abstract

Multimodal large language models require sophisticated memory systems that can track evolving environments and manage information dynamically across multiple sessions, with new benchmarks revealing limitations in current approaches.

AI-generated summary

Multimodal large language models are increasingly deployed as long-horizon agents, where memory must do more than recall: it must track an evolving world, revise what has gone stale, and surface the right evidence at decision time. Existing benchmarks measure recall over static dialogue, collapse memory into a single end-of-task accuracy, and reduce visual observations to captions, leaving us unable to localize failures to writing, maintenance, retrieval, or use. The rise of agent harnesses that author their own memory sharpens this gap, since we have no principled way to compare hand-designed pipelines with self-managing alternatives. To close these gaps, we formulate multimodal agent memory as an Action-World Interaction Loop with an observable four-stage lifecycle, and instantiate it in WorldMemArena: 400 multi-session multimodal tasks spanning Lifelong Evolution (evolving personal and task states) and Agentic Execution (memory from real observations, actions, and feedback), annotated with gold memory points, updates, distractors, and evidence chains for stage-level diagnosis. This enables the first head-to-head comparison of long-context, manually designed (RAG and external memory systems), and harness-based memory agents. Results show that: (1) better memory writing and storage do not guarantee better performance; (2) multimodal memory still struggles to fully use visual evidence; (3) systems are unstable across domains and degrade on realistic agentic trajectories; and (4) harness memory is more flexible but remains costly and less reliable.

View arXiv page View PDF Project page GitHub 9 Add to collection

Community

taesiri

Paper submitter 1 day ago

WorldMemArena is a new benchmark evaluating the multimodal memory of long-horizon agents using a four-stage Action-World Interaction Loop and multi-session tasks for detailed performance diagnostics.

librarian-bot

about 13 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.29341

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.29341 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.29341 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.29341 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

WorldMemArena: Evaluating Multimodal Agent Memory Through Action-World Interaction

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers