Hugging Face Daily Papers · · 6 min read

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Modern serving stacks assume every head holds an identical KV length, so non-uniform compression has stayed a paper-only idea. Non-uniform KV cache compression preserves accuracy far better than uniform schemes in multi-turn scenario — it gives the heads that actually carry long-range information the budget they need.</p>\n<ul>\n<li>✨ Tangram makes it practical for the first time — non-uniform KV cache compression running inside a real serving system (and uniform schemes work just as well)</li>\n<li>🔧 Built on vLLM as a drop-in substrate, Tangram supports a wide range of existing KV cache compression algorithms — non-uniform and uniform alike.</li>\n<li>📊 And we don't stop at accuracy: we validate real, measured end-to-end throughput gains — up to 2.6× over the full-KV baseline.</li>\n</ul>\n","updatedAt":"2026-06-16T04:29:13.983Z","author":{"_id":"63c0e2503bdc86f8108da51b","avatarUrl":"/avatars/7d47f11992f030b3d831e45102581d1f.svg","fullname":"Minsoo Kim","name":"minsoo2333","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8995974063873291},"editors":["minsoo2333"],"editorAvatarUrls":["/avatars/7d47f11992f030b3d831e45102581d1f.svg"],"reactions":[],"isReport":false}},{"id":"6a3148a36ced77f9b524bbc5","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false},"createdAt":"2026-06-16T12:59:15.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Neat paper. Dealing with memory bottlenecks for multi-turn serving is always a headache, and the idea of fixing head-wise retention offline to dodge the fragmentation mess seems like a smart way to make non-uniform compression actually viable in production.\n\nSince you are statically resolving these budgets ahead of time, how robust is the performance if a specific prompt deviates significantly from the 50 calibration samples you used to set the ratios?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/6b552aaf-cc72-4776-a008-68e5696c8500","html":"<p>Neat paper. Dealing with memory bottlenecks for multi-turn serving is always a headache, and the idea of fixing head-wise retention offline to dodge the fragmentation mess seems like a smart way to make non-uniform compression actually viable in production.</p>\n<p>Since you are statically resolving these budgets ahead of time, how robust is the performance if a specific prompt deviates significantly from the 50 calibration samples you used to set the ratios?</p>\n<p>I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:<br><a href=\"https://researchpod.app/episode/6b552aaf-cc72-4776-a008-68e5696c8500\" rel=\"nofollow\">https://researchpod.app/episode/6b552aaf-cc72-4776-a008-68e5696c8500</a></p>\n","updatedAt":"2026-06-16T12:59:15.855Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9171873331069946},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.06302","authors":[{"_id":"6a30bb0fa0d4daae4285fe80","user":{"_id":"6a30c9c77e3480ed655901d1","avatarUrl":"/avatars/380f7bc56640bc1b49dd10f305e0b9d0.svg","isPro":false,"fullname":"Hyungmin Kim","user":"hyungminkim","type":"user","name":"hyungminkim"},"name":"Hyungmin Kim","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:07:07.654Z","hidden":false},{"_id":"6a30bb0fa0d4daae4285fe81","user":{"_id":"63c0e2503bdc86f8108da51b","avatarUrl":"/avatars/7d47f11992f030b3d831e45102581d1f.svg","isPro":false,"fullname":"Minsoo Kim","user":"minsoo2333","type":"user","name":"minsoo2333"},"name":"Minsoo Kim","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:07:04.065Z","hidden":false},{"_id":"6a30bb0fa0d4daae4285fe82","user":{"_id":"646c866ae77eaac8f8d26b7f","avatarUrl":"/avatars/1d56d3a7cbe4428fe495bd720f014edc.svg","isPro":false,"fullname":"Kim","user":"hmkim97","type":"user","name":"hmkim97"},"name":"Hongseok Kim","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:07:11.483Z","hidden":false},{"_id":"6a30bb0fa0d4daae4285fe83","name":"Jungwook Choi","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving","submittedOnDailyBy":{"_id":"63c0e2503bdc86f8108da51b","avatarUrl":"/avatars/7d47f11992f030b3d831e45102581d1f.svg","isPro":false,"fullname":"Minsoo Kim","user":"minsoo2333","type":"user","name":"minsoo2333"},"summary":"Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to 1.7times or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to 2.6times over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.","upvotes":7,"discussionId":"6a30bb10a0d4daae4285fe84","projectPage":"https://aiha-lab.github.io/tangram-page/","githubRepo":"https://github.com/aiha-lab/tangram","githubRepoAddedBy":"user","ai_summary":"Multi-turn large language model serving faces memory constraints due to growing key-value cache, but a structured approach to non-uniform compression enables significant throughput improvements through static budget allocation and optimized memory management.","ai_keywords":["Key-Value (KV) cache","attention heads","non-uniform KV compression","page fragmentation","GPU workloads","prefill time","decode latency","head-wise retention","structural regularity","budget reservation","ragged paging","ahead-of-time load balancing","vLLM"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63c0e2503bdc86f8108da51b","avatarUrl":"/avatars/7d47f11992f030b3d831e45102581d1f.svg","isPro":false,"fullname":"Minsoo Kim","user":"minsoo2333","type":"user"},{"_id":"646c866ae77eaac8f8d26b7f","avatarUrl":"/avatars/1d56d3a7cbe4428fe495bd720f014edc.svg","isPro":false,"fullname":"Kim","user":"hmkim97","type":"user"},{"_id":"65c61bf30f5fdbda74a34cae","avatarUrl":"/avatars/cbfdecb89c16895ae2f0348c53a21aec.svg","isPro":true,"fullname":"Janghwan Lee","user":"superdocker","type":"user"},{"_id":"64edb21fee71252c6c98698e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/dLIaHxPZc4nASTkgE1JrE.png","isPro":false,"fullname":"Claire Shin","user":"claireshin","type":"user"},{"_id":"66597bdf2b6af3cf61aac170","avatarUrl":"/avatars/11ac7490b8e231241f0b5908a9983bd2.svg","isPro":false,"fullname":"Lee","user":"Woongkyu","type":"user"},{"_id":"628d111530de7f00af86bd65","avatarUrl":"/avatars/468eaed7a19781f1325d5a8c414f6be6.svg","isPro":false,"fullname":"Sihwa Lee","user":"macto","type":"user"},{"_id":"67106f425bc64dd54976f486","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/2eNbWy7uWBDSTwyVreLhE.png","isPro":false,"fullname":"HSK","user":"momarom","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.06302.md","query":{}}">
Papers
arxiv:2606.06302

Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Published on Jun 15
· Submitted by
Minsoo Kim
on Jun 16

Abstract

Multi-turn large language model serving faces memory constraints due to growing key-value cache, but a structured approach to non-uniform compression enables significant throughput improvements through static budget allocation and optimized memory management.

Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to 1.7times or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to 2.6times over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.

Community

Paper author Paper submitter about 9 hours ago

Modern serving stacks assume every head holds an identical KV length, so non-uniform compression has stayed a paper-only idea. Non-uniform KV cache compression preserves accuracy far better than uniform schemes in multi-turn scenario — it gives the heads that actually carry long-range information the budget they need.

  • ✨ Tangram makes it practical for the first time — non-uniform KV cache compression running inside a real serving system (and uniform schemes work just as well)
  • 🔧 Built on vLLM as a drop-in substrate, Tangram supports a wide range of existing KV cache compression algorithms — non-uniform and uniform alike.
  • 📊 And we don't stop at accuracy: we validate real, measured end-to-end throughput gains — up to 2.6× over the full-KV baseline.

Neat paper. Dealing with memory bottlenecks for multi-turn serving is always a headache, and the idea of fixing head-wise retention offline to dodge the fragmentation mess seems like a smart way to make non-uniform compression actually viable in production.

Since you are statically resolving these budgets ahead of time, how robust is the performance if a specific prompt deviates significantly from the 50 calibration samples you used to set the ratios?

I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/6b552aaf-cc72-4776-a008-68e5696c8500

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2606.06302
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.06302 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.06302 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.06302 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers