Modern serving stacks assume every head holds an identical KV length, so non-uniform compression has stayed a paper-only idea. Non-uniform KV cache compression preserves accuracy far better than uniform schemes in multi-turn scenario — it gives the heads that actually carry long-range information the budget they need.</p>\n<ul>\n<li>✨ Tangram makes it practical for the first time — non-uniform KV cache compression running inside a real serving system (and uniform schemes work just as well)</li>\n<li>🔧 Built on vLLM as a drop-in substrate, Tangram supports a wide range of existing KV cache compression algorithms — non-uniform and uniform alike.</li>\n<li>📊 And we don't stop at accuracy: we validate real, measured end-to-end throughput gains — up to 2.6× over the full-KV baseline.</li>\n</ul>\n","updatedAt":"2026-06-16T04:29:13.983Z","author":{"_id":"63c0e2503bdc86f8108da51b","avatarUrl":"/avatars/7d47f11992f030b3d831e45102581d1f.svg","fullname":"Minsoo Kim","name":"minsoo2333","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8995974063873291},"editors":["minsoo2333"],"editorAvatarUrls":["/avatars/7d47f11992f030b3d831e45102581d1f.svg"],"reactions":[],"isReport":false}},{"id":"6a3148a36ced77f9b524bbc5","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false},"createdAt":"2026-06-16T12:59:15.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Neat paper. Dealing with memory bottlenecks for multi-turn serving is always a headache, and the idea of fixing head-wise retention offline to dodge the fragmentation mess seems like a smart way to make non-uniform compression actually viable in production.\n\nSince you are statically resolving these budgets ahead of time, how robust is the performance if a specific prompt deviates significantly from the 50 calibration samples you used to set the ratios?\n\nI made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:\nhttps://researchpod.app/episode/6b552aaf-cc72-4776-a008-68e5696c8500","html":"<p>Neat paper. Dealing with memory bottlenecks for multi-turn serving is always a headache, and the idea of fixing head-wise retention offline to dodge the fragmentation mess seems like a smart way to make non-uniform compression actually viable in production.</p>\n<p>Since you are statically resolving these budgets ahead of time, how robust is the performance if a specific prompt deviates significantly from the 50 calibration samples you used to set the ratios?</p>\n<p>I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:<br><a href=\"https://researchpod.app/episode/6b552aaf-cc72-4776-a008-68e5696c8500\" rel=\"nofollow\">https://researchpod.app/episode/6b552aaf-cc72-4776-a008-68e5696c8500</a></p>\n","updatedAt":"2026-06-16T12:59:15.855Z","author":{"_id":"6960eca92f7ad9b043b5cbe0","avatarUrl":"/avatars/e68dcc7fd04f143d849d40414866e633.svg","fullname":"Noah","name":"noahml","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":0,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9171873331069946},"editors":["noahml"],"editorAvatarUrls":["/avatars/e68dcc7fd04f143d849d40414866e633.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.06302","authors":[{"_id":"6a30bb0fa0d4daae4285fe80","user":{"_id":"6a30c9c77e3480ed655901d1","avatarUrl":"/avatars/380f7bc56640bc1b49dd10f305e0b9d0.svg","isPro":false,"fullname":"Hyungmin Kim","user":"hyungminkim","type":"user","name":"hyungminkim"},"name":"Hyungmin Kim","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:07:07.654Z","hidden":false},{"_id":"6a30bb0fa0d4daae4285fe81","user":{"_id":"63c0e2503bdc86f8108da51b","avatarUrl":"/avatars/7d47f11992f030b3d831e45102581d1f.svg","isPro":false,"fullname":"Minsoo Kim","user":"minsoo2333","type":"user","name":"minsoo2333"},"name":"Minsoo Kim","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:07:04.065Z","hidden":false},{"_id":"6a30bb0fa0d4daae4285fe82","user":{"_id":"646c866ae77eaac8f8d26b7f","avatarUrl":"/avatars/1d56d3a7cbe4428fe495bd720f014edc.svg","isPro":false,"fullname":"Kim","user":"hmkim97","type":"user","name":"hmkim97"},"name":"Hongseok Kim","status":"claimed_verified","statusLastChangedAt":"2026-06-16T12:07:11.483Z","hidden":false},{"_id":"6a30bb0fa0d4daae4285fe83","name":"Jungwook Choi","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving","submittedOnDailyBy":{"_id":"63c0e2503bdc86f8108da51b","avatarUrl":"/avatars/7d47f11992f030b3d831e45102581d1f.svg","isPro":false,"fullname":"Minsoo Kim","user":"minsoo2333","type":"user","name":"minsoo2333"},"summary":"Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to 1.7times or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to 2.6times over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.","upvotes":7,"discussionId":"6a30bb10a0d4daae4285fe84","projectPage":"https://aiha-lab.github.io/tangram-page/","githubRepo":"https://github.com/aiha-lab/tangram","githubRepoAddedBy":"user","ai_summary":"Multi-turn large language model serving faces memory constraints due to growing key-value cache, but a structured approach to non-uniform compression enables significant throughput improvements through static budget allocation and optimized memory management.","ai_keywords":["Key-Value (KV) cache","attention heads","non-uniform KV compression","page fragmentation","GPU workloads","prefill time","decode latency","head-wise retention","structural regularity","budget reservation","ragged paging","ahead-of-time load balancing","vLLM"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":6},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"63c0e2503bdc86f8108da51b","avatarUrl":"/avatars/7d47f11992f030b3d831e45102581d1f.svg","isPro":false,"fullname":"Minsoo Kim","user":"minsoo2333","type":"user"},{"_id":"646c866ae77eaac8f8d26b7f","avatarUrl":"/avatars/1d56d3a7cbe4428fe495bd720f014edc.svg","isPro":false,"fullname":"Kim","user":"hmkim97","type":"user"},{"_id":"65c61bf30f5fdbda74a34cae","avatarUrl":"/avatars/cbfdecb89c16895ae2f0348c53a21aec.svg","isPro":true,"fullname":"Janghwan Lee","user":"superdocker","type":"user"},{"_id":"64edb21fee71252c6c98698e","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/dLIaHxPZc4nASTkgE1JrE.png","isPro":false,"fullname":"Claire Shin","user":"claireshin","type":"user"},{"_id":"66597bdf2b6af3cf61aac170","avatarUrl":"/avatars/11ac7490b8e231241f0b5908a9983bd2.svg","isPro":false,"fullname":"Lee","user":"Woongkyu","type":"user"},{"_id":"628d111530de7f00af86bd65","avatarUrl":"/avatars/468eaed7a19781f1325d5a8c414f6be6.svg","isPro":false,"fullname":"Sihwa Lee","user":"macto","type":"user"},{"_id":"67106f425bc64dd54976f486","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/2eNbWy7uWBDSTwyVreLhE.png","isPro":false,"fullname":"HSK","user":"momarom","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.06302.md","query":{}}">
Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving
Abstract
Multi-turn large language model serving faces memory constraints due to growing key-value cache, but a structured approach to non-uniform compression enables significant throughput improvements through static budget allocation and optimized memory management.
Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory -- not compute -- the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks assume identical KV lengths across heads, so heterogeneity traps freed memory as page fragmentation, spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to 1.7times or burn 15--20% of each decode step on re-planning. We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity -- an input-invariant head ranking with narrowly bounded per-head ratios -- that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and Ahead-of-Time Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to 2.6times over the full-KV baseline. Our implementation is publicly available at https://github.com/aiha-lab/TANGRAM.
Community
Modern serving stacks assume every head holds an identical KV length, so non-uniform compression has stayed a paper-only idea. Non-uniform KV cache compression preserves accuracy far better than uniform schemes in multi-turn scenario — it gives the heads that actually carry long-range information the budget they need.
- ✨ Tangram makes it practical for the first time — non-uniform KV cache compression running inside a real serving system (and uniform schemes work just as well)
- 🔧 Built on vLLM as a drop-in substrate, Tangram supports a wide range of existing KV cache compression algorithms — non-uniform and uniform alike.
- 📊 And we don't stop at accuracy: we validate real, measured end-to-end throughput gains — up to 2.6× over the full-KV baseline.
Neat paper. Dealing with memory bottlenecks for multi-turn serving is always a headache, and the idea of fixing head-wise retention offline to dodge the fragmentation mess seems like a smart way to make non-uniform compression actually viable in production.
Since you are statically resolving these budgets ahead of time, how robust is the performance if a specific prompt deviates significantly from the 50 calibration samples you used to set the ratios?
I made a podcast on it with ResearchPod, it makes it easy to get the key concepts on the go:
https://researchpod.app/episode/6b552aaf-cc72-4776-a008-68e5696c8500
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.06302 in a model README.md to link it from this page.
Cite arxiv.org/abs/2606.06302 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2606.06302 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.