Hugging Face Daily Papers · · 4 min read

End-to-End Context Compression at Scale

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Introduces Latent Context Language Models (LCLMs), an encoder-decoder framework that compresses long-context prompts into compact latent embeddings, significantly improving efficiency for memory-constrained LLM inference and long-horizon agentic tasks.</p>\n","updatedAt":"2026-06-09T03:54:46.292Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":312,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7305514812469482},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09659","authors":[{"_id":"6a278e6b6dde1c5ef75bcfef","name":"Ang Li","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff0","name":"Sean McLeish","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff1","name":"Haozhe Chen","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff2","name":"Nimit Kalra","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff3","name":"Zaiqian Chen","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff4","name":"Artem Gazizov","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff5","name":"Venkata Anoop Suhas Kumar Morisetty","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff6","name":"Bhavya Kailkhura","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff7","name":"Harshitha Menon","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff8","name":"Zhuang Liu","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff9","name":"Brian R. Bartoldson","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffa","name":"Tom Goldstein","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffb","name":"Sanae Lotfi","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffc","name":"Micah Goldblum","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffd","name":"Pavel Izmailov","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"End-to-End Context Compression at Scale","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.","upvotes":5,"discussionId":"6a278e6b6dde1c5ef75bcffe","githubRepo":"https://github.com/LeonLixyz/LCLM","githubRepoAddedBy":"user","ai_summary":"Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.","ai_keywords":["encoder-decoder compressors","KV cache","long-context language models","latent embeddings","encoder-decoder compression","architecture search","pre-training","compression ratios","Latent Context Language Models","long-horizon agents","adaptive expansion"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64a1b18b98fad0c8a5b04e3a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/8EvZSqz_GumooxK0uHkyR.png","isPro":true,"fullname":"Leon Li","user":"leonli66","type":"user"},{"_id":"66fc4c692408eb3bdeba876f","avatarUrl":"/avatars/66ba18ccb95d150e66d7b6930d4eb938.svg","isPro":false,"fullname":"Nimit Kalra","user":"nimitkalra","type":"user"},{"_id":"689d28137016b64e765471d8","avatarUrl":"/avatars/be331ed006f828ab63c173c5e5d42e8e.svg","isPro":false,"fullname":"Sanae Lotfi","user":"slotfi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09659.md"}">
Papers
arxiv:2606.09659

End-to-End Context Compression at Scale

Published on Jun 8
· Submitted by
taesiri
on Jun 9
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

Community

Paper submitter about 4 hours ago

Introduces Latent Context Language Models (LCLMs), an encoder-decoder framework that compresses long-context prompts into compact latent embeddings, significantly improving efficiency for memory-constrained LLM inference and long-horizon agentic tasks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.09659
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09659 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09659 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09659 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers