Hugging Face Daily Papers · June 24, 2026 · 3 min read

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Hi everyone, please see our latest work, EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies</p>\n","updatedAt":"2026-06-24T08:55:02.750Z","author":{"_id":"6565d7149afd51867e55520b","avatarUrl":"/avatars/027b17651e61df598af53f69b92e7771.svg","fullname":"Ganlin Yang","name":"ganlinyang","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":3,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5420696139335632},"editors":["ganlinyang"],"editorAvatarUrls":["/avatars/027b17651e61df598af53f69b92e7771.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.20092","authors":[{"_id":"6a38a0b3db23715e9da13901","name":"Ganlin Yang","hidden":false},{"_id":"6a38a0b3db23715e9da13902","name":"Zhangzheng Tu","hidden":false},{"_id":"6a38a0b3db23715e9da13903","name":"Yuqiang Yang","hidden":false},{"_id":"6a38a0b3db23715e9da13904","name":"Sitong Mao","hidden":false},{"_id":"6a38a0b3db23715e9da13905","name":"Junyi Dong","hidden":false},{"_id":"6a38a0b3db23715e9da13906","name":"Tianxing Chen","hidden":false},{"_id":"6a38a0b3db23715e9da13907","name":"Jiaqi Peng","hidden":false},{"_id":"6a38a0b3db23715e9da13908","name":"Jing Xiong","hidden":false},{"_id":"6a38a0b3db23715e9da13909","name":"Jiafei Cao","hidden":false},{"_id":"6a38a0b3db23715e9da1390a","name":"Jifeng Dai","hidden":false},{"_id":"6a38a0b3db23715e9da1390b","name":"Wengang Zhou","hidden":false},{"_id":"6a38a0b3db23715e9da1390c","name":"Yao Mu","hidden":false},{"_id":"6a38a0b3db23715e9da1390d","name":"Tai Wang","hidden":false}],"publishedAt":"2026-06-18T00:00:00.000Z","submittedOnDailyAt":"2026-06-24T00:00:00.000Z","title":"EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies","submittedOnDailyBy":{"_id":"6565d7149afd51867e55520b","avatarUrl":"/avatars/027b17651e61df598af53f69b92e7771.svg","isPro":false,"fullname":"Ganlin Yang","user":"ganlinyang","type":"user","name":"ganlinyang"},"summary":"Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.","upvotes":1,"discussionId":"6a38a0b4db23715e9da1390e","projectPage":"https://ganlin-yang.github.io/EventVLA.github.io/","githubRepo":"https://github.com/InternRobotics/EventVLA","githubRepoAddedBy":"user","ai_summary":"EventVLA addresses long-horizon robotic manipulation challenges by introducing a sparse visual evidence memory framework with visual anchors and dynamic Keyframe Evidence Memory module for improved task performance.","ai_keywords":["Vision-Language-Action","memory-augmented methods","visual anchors","Keyframe Evidence Memory","latent embeddings","causal utility","visual evidence","non-Markovian manipulation tasks","diagnostic benchmark"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":17,"organization":{"_id":"6747ee5decec679eafb90450","name":"ShanghaiAiLab","fullname":"shanghai ailab ","avatar":"https://www.gravatar.com/avatar/6cd2acf412ad103653d9ce14a1aacc19?d=retro&size=100"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6747ee5decec679eafb90450","name":"ShanghaiAiLab","fullname":"shanghai ailab ","avatar":"https://www.gravatar.com/avatar/6cd2acf412ad103653d9ce14a1aacc19?d=retro&size=100"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.20092.md","query":{}}">

Papers

arxiv:2606.20092

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Published on Jun 18

· Submitted by

Ganlin Yang on Jun 24

shanghai ailab

Upvote

Authors:

Abstract

EventVLA addresses long-horizon robotic manipulation challenges by introducing a sparse visual evidence memory framework with visual anchors and dynamic Keyframe Evidence Memory module for improved task performance.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.

View arXiv page View PDF Project page GitHub 17 Add to collection

Community

ganlinyang

Paper submitter about 16 hours ago

Hi everyone, please see our latest work, EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.20092

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.20092 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.20092 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.20092 in a Space README.md to link it from this page.

Collections including this paper 2

Discussion (0)

No comments yet. Sign in and be the first to say something.

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 2

Discussion (0)

More from Hugging Face Daily Papers