Hugging Face Daily Papers · June 10, 2026 · 4 min read

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Latent Memory is a novel method for efficient representation and QA generation. It shows that one token per multimodal evidence can lead to a good performance-efficiency trade-off.</p>\n","updatedAt":"2026-06-10T02:59:50.366Z","author":{"_id":"67a1d21e33e92b4a1183f3bb","avatarUrl":"/avatars/43f9dd3fcb7d58ddc69562fd1fc12957.svg","fullname":"Zhi Zheng","name":"zz1358m","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":4,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9119601845741272},"editors":["zz1358m"],"editorAvatarUrls":["/avatars/43f9dd3fcb7d58ddc69562fd1fc12957.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.10572","authors":[{"_id":"6a28d2d0e7d78ea7587e545b","name":"Zhi Zheng","hidden":false},{"_id":"6a28d2d0e7d78ea7587e545c","name":"Ziqiao Meng","hidden":false},{"_id":"6a28d2d0e7d78ea7587e545d","name":"Hao Luan","hidden":false},{"_id":"6a28d2d0e7d78ea7587e545e","name":"Wei Liu","hidden":false},{"_id":"6a28d2d0e7d78ea7587e545f","name":"Wee Sun Lee","hidden":false}],"publishedAt":"2026-06-09T08:36:08.000Z","submittedOnDailyAt":"2026-06-10T00:00:00.000Z","title":"One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA","submittedOnDailyBy":{"_id":"67a1d21e33e92b4a1183f3bb","avatarUrl":"/avatars/43f9dd3fcb7d58ddc69562fd1fc12957.svg","isPro":false,"fullname":"Zhi Zheng","user":"zz1358m","type":"user","name":"zz1358m"},"summary":"External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.","upvotes":13,"discussionId":"6a28d2d0e7d78ea7587e5460","projectPage":"https://huggingface.co/zz1358m/Latent-Memory-Master","githubRepo":"https://github.com/zz1358m/Latent-Memory-Master","githubRepoAddedBy":"user","ai_summary":"Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks.","ai_keywords":["external memory","large language models","vision-language models","question answering","latent-space memory","latent tokens","compressor LLM","pretrained LLM","retrieval-augmented generation","reconstruction objective","contrastive objective","distillation objective"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":4,"organization":{"_id":"6508ab2b349930913196378b","name":"NationalUniversityofSingapore","fullname":"National University of Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZYUmpSMsa5Whihw3me2Bw.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"67a1d21e33e92b4a1183f3bb","avatarUrl":"/avatars/43f9dd3fcb7d58ddc69562fd1fc12957.svg","isPro":false,"fullname":"Zhi Zheng","user":"zz1358m","type":"user"},{"_id":"689b365a5cac3f570d13d4e6","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/4pZOuTbUBG8DhXQVBZyPM.png","isPro":false,"fullname":"Jimmy Luo","user":"Leo5223","type":"user"},{"_id":"6a1ef6772a65642f97ab44e1","avatarUrl":"/avatars/d17285f1efdb5e5973dd6f6bec34c1dd.svg","isPro":false,"fullname":"Jiaqing Li","user":"ljq34952","type":"user"},{"_id":"6a1481b614524dbef227e9cf","avatarUrl":"/avatars/6950a2495e270f72af553803c7abbdb8.svg","isPro":false,"fullname":"Mario Ba","user":"marioba7","type":"user"},{"_id":"6a13e79ad6d10f8bcd3095e9","avatarUrl":"/avatars/c4d26a0e3820cfaa88b8412f4ad80b48.svg","isPro":false,"fullname":"Jun Li","user":"Bazinga699","type":"user"},{"_id":"69c77fb283ff14414556d7e4","avatarUrl":"/avatars/9d9ac6e2c1f152d224915b5296c9ff3d.svg","isPro":false,"fullname":"LINYAO MA","user":"LinYaoMa","type":"user"},{"_id":"6a28d5506cf38df26f667a78","avatarUrl":"/avatars/6bd74d5d09aecd13c1fa8d7622d43fab.svg","isPro":false,"fullname":"LUO Qing","user":"CarrotLuo","type":"user"},{"_id":"6a28d8e85160513fef7f5c0a","avatarUrl":"/avatars/4430bb85e20708f1a1f4aa62e565be81.svg","isPro":false,"fullname":"Xie Yue","user":"EkkoXy","type":"user"},{"_id":"67b2d9cb48d466697ba54563","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/N7zTMQvNNOJL4TMNmRNxt.png","isPro":false,"fullname":"Guyu","user":"kuangrepi","type":"user"},{"_id":"672ded5b955ecee9abfa38c9","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/06ZgW0Y9ErSmQoqtzTt5g.png","isPro":false,"fullname":"Hao Luan","user":"edmundluan","type":"user"},{"_id":"6912c6840610df3fe9c53b9c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6912c6840610df3fe9c53b9c/g4XJ3dt1msy7jiEVc5DGp.jpeg","isPro":false,"fullname":"Zihan Gao","user":"Papercold","type":"user"},{"_id":"68d399a9bf62e5fdac9b6ff7","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/ACo3hpcfi9XsupsvORBAh.png","isPro":false,"fullname":"Yuxuan liu","user":"LancasterLiu","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6508ab2b349930913196378b","name":"NationalUniversityofSingapore","fullname":"National University of Singapore","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/630ca0817dacb93b33506ce7/ZYUmpSMsa5Whihw3me2Bw.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.10572.md"}">

Papers

arxiv:2606.10572

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Published on Jun 9

· Submitted by

Zhi Zheng on Jun 10

National University of Singapore

Upvote

Authors:

Abstract

Latent Memory introduces a compressed representation approach for external memory in question answering, reducing token consumption and storage requirements while maintaining competitive performance across text-only and multimodal benchmarks.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.

View arXiv page View PDF Project page GitHub 4 Add to collection

Community

zz1358m

Paper submitter about 14 hours ago

Latent Memory is a novel method for efficient representation and QA generation. It shows that one token per multimodal evidence can lead to a good performance-efficiency trade-off.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.10572

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.10572 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.10572 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.10572 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers