Hugging Face Daily Papers · · 3 min read

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

MemDreamer decouples perception and reasoning of long-video understanding via Hierachical Graph Memory and Agentic retrieval mechanism. This paradigm bypasses context limits and<br>mitigates attention dilution, offering a promising scaling direction for future multimodal comprehension.</p>\n<p>We warmly welcome feedback, comments, and constructive criticism from the community.</p>\n","updatedAt":"2026-06-10T06:03:35.668Z","author":{"_id":"63f58403fcf95ecac2b33d78","avatarUrl":"/avatars/a77ea80784896502ae1cfa086a78ce66.svg","fullname":"Zhen Yang","name":"YZCS","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8157941699028015},"editors":["YZCS"],"editorAvatarUrls":["/avatars/a77ea80784896502ae1cfa086a78ce66.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.07512","authors":[{"_id":"6a28fa4ee7d78ea7587e55f5","name":"Cong Chen","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55f6","name":"Guo Gan","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55f7","name":"Kaixiang Ji","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55f8","name":"ChaoYang Zhang","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55f9","name":"Zhen Yang","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55fa","name":"Guangming Yao","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55fb","name":"Hao Chen","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55fc","name":"Jingdong Chen","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55fd","name":"Yi Yuan","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55fe","name":"Chunhua Shen","hidden":false}],"publishedAt":"2026-06-05T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism","submittedOnDailyBy":{"_id":"63f58403fcf95ecac2b33d78","avatarUrl":"/avatars/a77ea80784896502ae1cfa086a78ce66.svg","isPro":false,"fullname":"Zhen Yang","user":"YZCS","type":"user","name":"YZCS"},"summary":"Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.","upvotes":35,"discussionId":"6a28fa4ee7d78ea7587e55ff","projectPage":"https://aim-uofa.github.io/MemDreamer/","ai_summary":"MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead.","ai_keywords":["Vision-Language Models","token explosion","attention dilution","Hierarchical Graph Memory","agentic exploration","spatiotemporal relations","causal relations","tool-augmented retrieval","Observation-Reason-Action loop","multimodal comprehension"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"67aea5c8f086ab0f70ed97c9","name":"inclusionAI","fullname":"inclusionAI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/662e1f9da266499277937d33/fyKuazRifqiaIO34xrhhm.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63f58403fcf95ecac2b33d78","avatarUrl":"/avatars/a77ea80784896502ae1cfa086a78ce66.svg","isPro":false,"fullname":"Zhen Yang","user":"YZCS","type":"user"},{"_id":"6549ab205018913069fb8eab","avatarUrl":"/avatars/30e09cda80a2bcca9100e3464c175529.svg","isPro":false,"fullname":"chencong","user":"Chencong1","type":"user"},{"_id":"63c9537586529da209591cf1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674138561431-63c9537586529da209591cf1.jpeg","isPro":false,"fullname":"Chengxiang Fan","user":"leaf1170124460","type":"user"},{"_id":"647598b13a3559d09b576a46","avatarUrl":"/avatars/313f05633a67bf76ba1ccaa619558024.svg","isPro":false,"fullname":"Anzhou Li","user":"andrianlee","type":"user"},{"_id":"677a3eb68b252e571a1fce1a","avatarUrl":"/avatars/f2260ef6ade5cb22606516f26e872b57.svg","isPro":false,"fullname":"Teng LI","user":"tliby","type":"user"},{"_id":"65dfeee3d16fb170031df293","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65dfeee3d16fb170031df293/2VbNuqcpN3XrWB18NfzRQ.jpeg","isPro":false,"fullname":"gan","user":"guo9","type":"user"},{"_id":"6444a43b2a1e0141652cb020","avatarUrl":"/avatars/2077fd5c674f75d34087eaf9badacd68.svg","isPro":false,"fullname":"Hao Li","user":"howardlee","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"66c842371857908897c9d69a","avatarUrl":"/avatars/aa9f2cef323920c3fa572fb3521e1382.svg","isPro":false,"fullname":"Alvin Lau","user":"AlpsCracker","type":"user"},{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","isPro":false,"fullname":"zhumuzhi","user":"Z-MU-Z","type":"user"},{"_id":"68e7773faf3a162f9301a117","avatarUrl":"/avatars/86010187beb5cd964eb9cc1db6e64dee.svg","isPro":false,"fullname":"Kaixiang Ji","user":"TorryJ","type":"user"},{"_id":"667e81565934c9fae29207ef","avatarUrl":"/avatars/431e777c71fccf7cf48ce013e5f6f1cb.svg","isPro":false,"fullname":"Zhou","user":"ZhouTimeMachine","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67aea5c8f086ab0f70ed97c9","name":"inclusionAI","fullname":"inclusionAI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/662e1f9da266499277937d33/fyKuazRifqiaIO34xrhhm.jpeg"}}">
Papers
arxiv:2606.07512

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Published on Jun 5
· Submitted by
Zhen Yang
on Jun 10
Authors:
,
,
,
,
,
,
,
,
,

Abstract

MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead.

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

Community

Paper submitter about 11 hours ago

MemDreamer decouples perception and reasoning of long-video understanding via Hierachical Graph Memory and Agentic retrieval mechanism. This paradigm bypasses context limits and
mitigates attention dilution, offering a promising scaling direction for future multimodal comprehension.

We warmly welcome feedback, comments, and constructive criticism from the community.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.07512 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.07512 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.07512 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers