Hugging Face Daily Papers · June 10, 2026 · 3 min read

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

MemDreamer decouples perception and reasoning of long-video understanding via Hierachical Graph Memory and Agentic retrieval mechanism. This paradigm bypasses context limits and<br>mitigates attention dilution, offering a promising scaling direction for future multimodal comprehension.</p>\n<p>We warmly welcome feedback, comments, and constructive criticism from the community.</p>\n","updatedAt":"2026-06-10T06:03:35.668Z","author":{"_id":"63f58403fcf95ecac2b33d78","avatarUrl":"/avatars/a77ea80784896502ae1cfa086a78ce66.svg","fullname":"Zhen Yang","name":"YZCS","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":2,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8157941699028015},"editors":["YZCS"],"editorAvatarUrls":["/avatars/a77ea80784896502ae1cfa086a78ce66.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.07512","authors":[{"_id":"6a28fa4ee7d78ea7587e55f5","name":"Cong Chen","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55f6","name":"Guo Gan","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55f7","name":"Kaixiang Ji","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55f8","name":"ChaoYang Zhang","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55f9","name":"Zhen Yang","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55fa","name":"Guangming Yao","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55fb","name":"Hao Chen","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55fc","name":"Jingdong Chen","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55fd","name":"Yi Yuan","hidden":false},{"_id":"6a28fa4ee7d78ea7587e55fe","name":"Chunhua Shen","hidden":false}],"publishedAt":"2026-06-05T00:00:00.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism","submittedOnDailyBy":{"_id":"63f58403fcf95ecac2b33d78","avatarUrl":"/avatars/a77ea80784896502ae1cfa086a78ce66.svg","isPro":false,"fullname":"Zhen Yang","user":"YZCS","type":"user","name":"YZCS"},"summary":"Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.","upvotes":35,"discussionId":"6a28fa4ee7d78ea7587e55ff","projectPage":"https://aim-uofa.github.io/MemDreamer/","ai_summary":"MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead.","ai_keywords":["Vision-Language Models","token explosion","attention dilution","Hierarchical Graph Memory","agentic exploration","spatiotemporal relations","causal relations","tool-augmented retrieval","Observation-Reason-Action loop","multimodal comprehension"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","organization":{"_id":"67aea5c8f086ab0f70ed97c9","name":"inclusionAI","fullname":"inclusionAI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/662e1f9da266499277937d33/fyKuazRifqiaIO34xrhhm.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63f58403fcf95ecac2b33d78","avatarUrl":"/avatars/a77ea80784896502ae1cfa086a78ce66.svg","isPro":false,"fullname":"Zhen Yang","user":"YZCS","type":"user"},{"_id":"6549ab205018913069fb8eab","avatarUrl":"/avatars/30e09cda80a2bcca9100e3464c175529.svg","isPro":false,"fullname":"chencong","user":"Chencong1","type":"user"},{"_id":"63c9537586529da209591cf1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1674138561431-63c9537586529da209591cf1.jpeg","isPro":false,"fullname":"Chengxiang Fan","user":"leaf1170124460","type":"user"},{"_id":"647598b13a3559d09b576a46","avatarUrl":"/avatars/313f05633a67bf76ba1ccaa619558024.svg","isPro":false,"fullname":"Anzhou Li","user":"andrianlee","type":"user"},{"_id":"677a3eb68b252e571a1fce1a","avatarUrl":"/avatars/f2260ef6ade5cb22606516f26e872b57.svg","isPro":false,"fullname":"Teng LI","user":"tliby","type":"user"},{"_id":"65dfeee3d16fb170031df293","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65dfeee3d16fb170031df293/2VbNuqcpN3XrWB18NfzRQ.jpeg","isPro":false,"fullname":"gan","user":"guo9","type":"user"},{"_id":"6444a43b2a1e0141652cb020","avatarUrl":"/avatars/2077fd5c674f75d34087eaf9badacd68.svg","isPro":false,"fullname":"Hao Li","user":"howardlee","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"66c842371857908897c9d69a","avatarUrl":"/avatars/aa9f2cef323920c3fa572fb3521e1382.svg","isPro":false,"fullname":"Alvin Lau","user":"AlpsCracker","type":"user"},{"_id":"632179745fc60c44fd91fc33","avatarUrl":"/avatars/37d4fefbcc19f091dccffefec9706de2.svg","isPro":false,"fullname":"zhumuzhi","user":"Z-MU-Z","type":"user"},{"_id":"68e7773faf3a162f9301a117","avatarUrl":"/avatars/86010187beb5cd964eb9cc1db6e64dee.svg","isPro":false,"fullname":"Kaixiang Ji","user":"TorryJ","type":"user"},{"_id":"667e81565934c9fae29207ef","avatarUrl":"/avatars/431e777c71fccf7cf48ce013e5f6f1cb.svg","isPro":false,"fullname":"Zhou","user":"ZhouTimeMachine","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"67aea5c8f086ab0f70ed97c9","name":"inclusionAI","fullname":"inclusionAI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/662e1f9da266499277937d33/fyKuazRifqiaIO34xrhhm.jpeg"}}">

Papers

arxiv:2606.07512

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Published on Jun 5

· Submitted by

Zhen Yang on Jun 10

inclusionAI

Upvote

Authors:

Abstract

MemDreamer addresses long-video understanding challenges by decoupling perception and reasoning through hierarchical graph memory and agentic exploration, achieving state-of-the-art performance with reduced computational overhead.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.