Hugging Face Daily Papers · June 9, 2026 · 4 min read

End-to-End Context Compression at Scale

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Introduces Latent Context Language Models (LCLMs), an encoder-decoder framework that compresses long-context prompts into compact latent embeddings, significantly improving efficiency for memory-constrained LLM inference and long-horizon agentic tasks.</p>\n","updatedAt":"2026-06-09T03:54:46.292Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":312,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7305514812469482},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09659","authors":[{"_id":"6a278e6b6dde1c5ef75bcfef","name":"Ang Li","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff0","name":"Sean McLeish","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff1","name":"Haozhe Chen","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff2","name":"Nimit Kalra","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff3","name":"Zaiqian Chen","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff4","name":"Artem Gazizov","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff5","name":"Venkata Anoop Suhas Kumar Morisetty","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff6","name":"Bhavya Kailkhura","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff7","name":"Harshitha Menon","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff8","name":"Zhuang Liu","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff9","name":"Brian R. Bartoldson","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffa","name":"Tom Goldstein","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffb","name":"Sanae Lotfi","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffc","name":"Micah Goldblum","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffd","name":"Pavel Izmailov","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"End-to-End Context Compression at Scale","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.","upvotes":5,"discussionId":"6a278e6b6dde1c5ef75bcffe","githubRepo":"https://github.com/LeonLixyz/LCLM","githubRepoAddedBy":"user","ai_summary":"Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.","ai_keywords":["encoder-decoder compressors","KV cache","long-context language models","latent embeddings","encoder-decoder compression","architecture search","pre-training","compression ratios","Latent Context Language Models","long-horizon agents","adaptive expansion"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64a1b18b98fad0c8a5b04e3a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/8EvZSqz_GumooxK0uHkyR.png","isPro":true,"fullname":"Leon Li","user":"leonli66","type":"user"},{"_id":"66fc4c692408eb3bdeba876f","avatarUrl":"/avatars/66ba18ccb95d150e66d7b6930d4eb938.svg","isPro":false,"fullname":"Nimit Kalra","user":"nimitkalra","type":"user"},{"_id":"689d28137016b64e765471d8","avatarUrl":"/avatars/be331ed006f828ab63c173c5e5d42e8e.svg","isPro":false,"fullname":"Sanae Lotfi","user":"slotfi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09659.md"}">

Papers

arxiv:2606.09659

End-to-End Context Compression at Scale

Published on Jun 8

· Submitted by

taesiri on Jun 9

Upvote

Authors:

Abstract

Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

View arXiv page View PDF GitHub 3 Add to collection

Community

taesiri

Paper submitter about 4 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.09659

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.09659 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.09659 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.09659 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

End-to-End Context Compression at Scale

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers