Introduces Latent Context Language Models (LCLMs), an encoder-decoder framework that compresses long-context prompts into compact latent embeddings, significantly improving efficiency for memory-constrained LLM inference and long-horizon agentic tasks.</p>\n","updatedAt":"2026-06-09T03:54:46.292Z","author":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","fullname":"taesiri","name":"taesiri","type":"user","isPro":true,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":312,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7305514812469482},"editors":["taesiri"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.09659","authors":[{"_id":"6a278e6b6dde1c5ef75bcfef","name":"Ang Li","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff0","name":"Sean McLeish","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff1","name":"Haozhe Chen","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff2","name":"Nimit Kalra","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff3","name":"Zaiqian Chen","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff4","name":"Artem Gazizov","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff5","name":"Venkata Anoop Suhas Kumar Morisetty","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff6","name":"Bhavya Kailkhura","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff7","name":"Harshitha Menon","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff8","name":"Zhuang Liu","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcff9","name":"Brian R. Bartoldson","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffa","name":"Tom Goldstein","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffb","name":"Sanae Lotfi","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffc","name":"Micah Goldblum","hidden":false},{"_id":"6a278e6b6dde1c5ef75bcffd","name":"Pavel Izmailov","hidden":false}],"publishedAt":"2026-06-08T00:00:00.000Z","submittedOnDailyAt":"2026-06-09T00:00:00.000Z","title":"End-to-End Context Compression at Scale","submittedOnDailyBy":{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user","name":"taesiri"},"summary":"Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.","upvotes":5,"discussionId":"6a278e6b6dde1c5ef75bcffe","githubRepo":"https://github.com/LeonLixyz/LCLM","githubRepoAddedBy":"user","ai_summary":"Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.","ai_keywords":["encoder-decoder compressors","KV cache","long-context language models","latent embeddings","encoder-decoder compression","architecture search","pre-training","compression ratios","Latent Context Language Models","long-horizon agents","adaptive expansion"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":3},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6039478ab3ecf716b1a5fd4d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6039478ab3ecf716b1a5fd4d/_Thy4E7taiSYBLKxEKJbT.jpeg","isPro":true,"fullname":"taesiri","user":"taesiri","type":"user"},{"_id":"620783f24e28382272337ba4","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/620783f24e28382272337ba4/zkUveQPNiDfYjgGhuFErj.jpeg","isPro":false,"fullname":"GuoLiangTang","user":"Tommy930","type":"user"},{"_id":"64a1b18b98fad0c8a5b04e3a","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/8EvZSqz_GumooxK0uHkyR.png","isPro":true,"fullname":"Leon Li","user":"leonli66","type":"user"},{"_id":"66fc4c692408eb3bdeba876f","avatarUrl":"/avatars/66ba18ccb95d150e66d7b6930d4eb938.svg","isPro":false,"fullname":"Nimit Kalra","user":"nimitkalra","type":"user"},{"_id":"689d28137016b64e765471d8","avatarUrl":"/avatars/be331ed006f828ab63c173c5e5d42e8e.svg","isPro":false,"fullname":"Sanae Lotfi","user":"slotfi","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.09659.md"}">
End-to-End Context Compression at Scale
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Encoder-decoder compression techniques are improved through architectural search and large-scale pretraining to create Latent Context Language Models that efficiently handle long contexts with better performance and memory usage compared to traditional KV cache methods.
Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.
Community
Introduces Latent Context Language Models (LCLMs), an encoder-decoder framework that compresses long-context prompts into compact latent embeddings, significantly improving efficiency for memory-constrained LLM inference and long-horizon agentic tasks.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.09659 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.09659 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.09659 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.