Hugging Face Daily Papers · · 4 min read

Task-Focused Memorization for Multimodal Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

<a href=\"https://taskmem.github.io/\" rel=\"nofollow\">https://taskmem.github.io/</a></p>\n","updatedAt":"2026-06-01T03:14:22.586Z","author":{"_id":"60ea81771cc8dc259c58e905","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60ea81771cc8dc259c58e905/kmGlaNvdS4EEHc_5qongT.jpeg","fullname":"yichen he","name":"hyc2026","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":10,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.3102342188358307},"editors":["hyc2026"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/60ea81771cc8dc259c58e905/kmGlaNvdS4EEHc_5qongT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.31075","authors":[{"_id":"6a1cf835808ddbc3c7d434bf","name":"Tao Zou","hidden":false},{"_id":"6a1cf835808ddbc3c7d434c0","name":"Yichen He","hidden":false},{"_id":"6a1cf835808ddbc3c7d434c1","name":"Tian Qiu","hidden":false},{"_id":"6a1cf835808ddbc3c7d434c2","name":"Yuan Lin","hidden":false},{"_id":"6a1cf835808ddbc3c7d434c3","name":"Hang Li","hidden":false}],"publishedAt":"2026-05-29T00:00:00.000Z","submittedOnDailyAt":"2026-06-01T00:00:00.000Z","title":"Task-Focused Memorization for Multimodal Agents","submittedOnDailyBy":{"_id":"60ea81771cc8dc259c58e905","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60ea81771cc8dc259c58e905/kmGlaNvdS4EEHc_5qongT.jpeg","isPro":false,"fullname":"yichen he","user":"hyc2026","type":"user","name":"hyc2026"},"summary":"Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.","upvotes":22,"discussionId":"6a1cf836808ddbc3c7d434c4","projectPage":"https://taskmem.github.io/","ai_summary":"A reinforcement-learning-based framework called TaskMem is introduced to dynamically determine what information to store in long-term memory for multimodal agents, improving performance on streaming video benchmarks.","ai_keywords":["multimodal agents","long-term memory","continual learning","memorization policy","reinforcement learning","task-focused memorization","two-phase training","memory quality","reward model","streaming benchmarks","VideoMME","EgoLife","EgoTempo","Qwen3-VL-30B-A3B","VQA accuracy"],"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"60ea81771cc8dc259c58e905","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60ea81771cc8dc259c58e905/kmGlaNvdS4EEHc_5qongT.jpeg","isPro":false,"fullname":"yichen he","user":"hyc2026","type":"user"},{"_id":"656405bc2741d606a0f351b6","avatarUrl":"/avatars/b35f460d5663fdb2508a8ca958612d6b.svg","isPro":false,"fullname":"Yuan","user":"YLin0","type":"user"},{"_id":"687e7366970156856e834838","avatarUrl":"/avatars/62283e6973b403910ada70b28a11de70.svg","isPro":false,"fullname":"Tian Qiu","user":"unfair11212","type":"user"},{"_id":"66d685c7a0ecfbe06cbdb25b","avatarUrl":"/avatars/38099eedf764fa6cf1c8da2909d1753a.svg","isPro":true,"fullname":"Tianle Li","user":"L3133625978","type":"user"},{"_id":"67247adb73d1eb17b6bfd27c","avatarUrl":"/avatars/57bdbb7362f9854c87dd0a71ae071652.svg","isPro":false,"fullname":"Zefeng He","user":"yhx12","type":"user"},{"_id":"68b55382f629b8d01fae509c","avatarUrl":"/avatars/1b0f656816c15e59bc80bf76f6378e99.svg","isPro":false,"fullname":"zoutao","user":"zoutaobyte","type":"user"},{"_id":"680a1c796d379938f1eda869","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/LDuv2i2TPvO9g58MeNy-D.png","isPro":false,"fullname":"trung","user":"kartonalberta","type":"user"},{"_id":"68b25108b24f311b7de794c1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/juLl1c4MocvZTTG8T7uVZ.png","isPro":false,"fullname":"Micheal Xie","user":"Micheal-Xie","type":"user"},{"_id":"693bfa842210c5b4ad15c4b1","avatarUrl":"/avatars/62268599ba264e51950035bcdb356ae1.svg","isPro":false,"fullname":"CAOZHI","user":"GOODBOY666","type":"user"},{"_id":"6a1d00213adf5da0616fce47","avatarUrl":"/avatars/614b028c87e51524b39735c6983d65be.svg","isPro":false,"fullname":"jiajing huang","user":"judginghuang","type":"user"},{"_id":"6704e5e2e5a6a4fb170c5b64","avatarUrl":"/avatars/a7bc9971685c6aebd7c229a10b1452c5.svg","isPro":false,"fullname":"lixiangfan","user":"lixiangfan","type":"user"},{"_id":"663dfeb0b6d7da433389edac","avatarUrl":"/avatars/da44cad77d0f8e7052ba54b80eb1f620.svg","isPro":false,"fullname":"zhouyumeng","user":"JoyMengMeng","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67d1140985ea0644e2f14b99","name":"ByteDance-Seed","fullname":"ByteDance Seed","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6535c9e88bde2fae19b6fb25/flkDUqd_YEuFsjeNET3r-.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.31075.md"}">
Papers
arxiv:2605.31075

Task-Focused Memorization for Multimodal Agents

Published on May 29
· Submitted by
yichen he
on Jun 1
Authors:
,
,
,
,

Abstract

A reinforcement-learning-based framework called TaskMem is introduced to dynamically determine what information to store in long-term memory for multimodal agents, improving performance on streaming video benchmarks.

AI-generated summary

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.31075
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.31075 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.31075 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.31075 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers