Hugging Face Daily Papers · · 4 min read

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Code and dataset: <a href=\"https://github.com/amy-hyunji/MINTEval\" rel=\"nofollow\">https://github.com/amy-hyunji/MINTEval</a></p>\n","updatedAt":"2026-05-21T03:08:58.432Z","author":{"_id":"61ffaa2943eb0913fa2df74a","avatarUrl":"/avatars/a19971f830abb8a8ae95e5800beb9fcd.svg","fullname":"Singh","name":"joykirat","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7744503021240234},"editors":["joykirat"],"editorAvatarUrls":["/avatars/a19971f830abb8a8ae95e5800beb9fcd.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.18565","authors":[{"_id":"6a0bfa248ca2d0b256380565","name":"Hyunji Lee","hidden":false},{"_id":"6a0bfa248ca2d0b256380566","name":"Justin Chih-Yao Chen","hidden":false},{"_id":"6a0bfa248ca2d0b256380567","name":"Joykirat Singh","hidden":false},{"_id":"6a0bfa248ca2d0b256380568","name":"Zaid Khan","hidden":false},{"_id":"6a0bfa248ca2d0b256380569","name":"Elias Stengel-Eskin","hidden":false},{"_id":"6a0bfa248ca2d0b25638056a","name":"Mohit Bansal","hidden":false}],"publishedAt":"2026-05-18T00:00:00.000Z","submittedOnDailyAt":"2026-05-21T00:00:00.000Z","title":"LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems","submittedOnDailyBy":{"_id":"61ffaa2943eb0913fa2df74a","avatarUrl":"/avatars/a19971f830abb8a8ae95e5800beb9fcd.svg","isPro":false,"fullname":"Singh","user":"joykirat","type":"user","name":"joykirat"},"summary":"Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce LongMINT (Long-Horizon Memory under INTerference), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, LongMINT has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are later revised or interfered with by subsequent context, with performance degrading as the number of intervening updates increases.","upvotes":2,"discussionId":"6a0bfa248ca2d0b25638056b","ai_summary":"Existing memory-augmented agents struggle with long-horizon, interference-heavy settings that require accurate recall and aggregated reasoning across evolving information.","ai_keywords":["memory-augmented agents","long-horizon settings","interference","retrieval","memory construction","long-context LLMs","RAG","question-answering pairs","token length","aggregated reasoning"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"698302e1d27f75df199cb141","avatarUrl":"/avatars/09a03fafda8ea08b1f0e4989f1b6733e.svg","isPro":false,"fullname":"Ling Wei","user":"chat-master","type":"user"},{"_id":"6984daf129d7bf4d05e8d629","avatarUrl":"/avatars/249117f92a457804d8577558675cb536.svg","isPro":false,"fullname":"Zhao Lei","user":"valsco","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.18565.md"}">
Papers
arxiv:2605.18565

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

Published on May 18
· Submitted by
Singh
on May 21
Authors:
,
,
,
,
,

Abstract

Existing memory-augmented agents struggle with long-horizon, interference-heavy settings that require accurate recall and aggregated reasoning across evolving information.

AI-generated summary

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce LongMINT (Long-Horizon Memory under INTerference), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, LongMINT has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are later revised or interfered with by subsequent context, with performance degrading as the number of intervening updates increases.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.18565
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.18565 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.18565 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers