Hugging Face Daily Papers · May 22, 2026 · 4 min read

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We’re excited to share that KVServe has been accepted by SIGCOMM 2026 🎉\nKVServe is a service-aware KV cache compression framework for communication-efficient disaggregated LLM serving. It treats KV compression as a runtime strategy selection problem rather than a fixed configuration, and achieves up to 9.13× JCT speedup in PD-separated serving and 32.8× TTFT reduction in KV-disaggregated serving. :)\n","updatedAt":"2026-05-22T02:20:48.308Z","author":{"_id":"66e685f4f910b65e13cbb9ef","avatarUrl":"/avatars/b0f10174af15122b9288334965058d85.svg","fullname":"Zedong Liu","name":"CapitalLiu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9273362159729004},"editors":["CapitalLiu"],"editorAvatarUrls":["/avatars/b0f10174af15122b9288334965058d85.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.13734","authors":[{"_id":"6a0fbb43a53a61ce2e422c55","name":"Zedong Liu","hidden":false},{"_id":"6a0fbb43a53a61ce2e422c56","name":"Xinyang Ma","hidden":false},{"_id":"6a0fbb43a53a61ce2e422c57","name":"Dejun Luo","hidden":false},{"_id":"6a0fbb43a53a61ce2e422c58","name":"Hairui Zhao","hidden":false},{"_id":"6a0fbb43a53a61ce2e422c59","name":"Bing Lu","hidden":false},{"_id":"6a0fbb43a53a61ce2e422c5a","name":"Wenjing Huang","hidden":false},{"_id":"6a0fbb43a53a61ce2e422c5b","name":"Yida Gu","hidden":false},{"_id":"6a0fbb43a53a61ce2e422c5c","name":"Xingchen Liu","hidden":false},{"_id":"6a0fbb43a53a61ce2e422c5d","name":"Zheng Wei","hidden":false},{"_id":"6a0fbb43a53a61ce2e422c5e","name":"Jinyang Liu","hidden":false},{"_id":"6a0fbb43a53a61ce2e422c5f","name":"Dingwen Tao","hidden":false},{"_id":"6a0fbb43a53a61ce2e422c60","name":"Guangming Tan","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/66e685f4f910b65e13cbb9ef/B4paPq_GZvPY7MhiKHGzx.png","https://cdn-uploads.huggingface.co/production/uploads/66e685f4f910b65e13cbb9ef/BO8Mkjj7b2GScauDk15vN.png","https://cdn-uploads.huggingface.co/production/uploads/66e685f4f910b65e13cbb9ef/8qRtgsplWcCn46RYIDZ6V.png","https://cdn-uploads.huggingface.co/production/uploads/66e685f4f910b65e13cbb9ef/HKZJS0a520n7IeU5WXlC9.png","https://cdn-uploads.huggingface.co/production/uploads/66e685f4f910b65e13cbb9ef/dbPNgd46vHSOiioJtD6Qv.png"],"publishedAt":"2026-05-13T00:00:00.000Z","submittedOnDailyAt":"2026-05-22T00:00:00.000Z","title":"KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving","submittedOnDailyBy":{"_id":"66e685f4f910b65e13cbb9ef","avatarUrl":"/avatars/b0f10174af15122b9288334965058d85.svg","isPro":false,"fullname":"Zedong Liu","user":"CapitalLiu","type":"user","name":"CapitalLiu"},"summary":"LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present KVServe, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing 50times offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to 9.13times JCT speedup in PD-separated serving and up to 32.8times TTFT reduction in KV-disaggregated serving.","upvotes":6,"discussionId":"6a0fbb43a53a61ce2e422c61","githubRepo":"https://github.com/hpdps-group/KVServe","githubRepoAddedBy":"user","ai_summary":"KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.","ai_keywords":["disaggregated LLM serving","KV state disaggregation","KV compression","Bayesian Profiling Engine","Pareto candidate set","Service-Aware Online Controller","analytical latency model","bandit algorithm","JCT speedup","TTFT reduction"],"githubStars":7,"organization":{"_id":"68ef0ab704f0f03d81964936","name":"ict-cas","fullname":"Institute of Computing Technology, Chinese Academy of Sciences","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/656ad93853703dd78f3de7b8/MR5EsF33Ev3IdnMOHv1be.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"66e685f4f910b65e13cbb9ef","avatarUrl":"/avatars/b0f10174af15122b9288334965058d85.svg","isPro":false,"fullname":"Zedong Liu","user":"CapitalLiu","type":"user"},{"_id":"6a0fc22900fb1ba65c4a0026","avatarUrl":"/avatars/c2ba84d0917cd0cb1c8098eb225eaa95.svg","isPro":false,"fullname":"buding","user":"kexun77","type":"user"},{"_id":"6909c69e0bcecfeb10865be4","avatarUrl":"/avatars/bd55069e73f4c7b991a3ef0f47646aad.svg","isPro":false,"fullname":"chen","user":"hannah1231","type":"user"},{"_id":"65bb837dbfb878f46c77de4c","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65bb837dbfb878f46c77de4c/23gZ_lBEwyoqjexFy9QLD.jpeg","isPro":true,"fullname":"Prithiv Sakthi","user":"prithivMLmods","type":"user"},{"_id":"67ada46728e7b19b9d73b7e0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/no-auth/KmdIFrSlrdUt0ZQUTS4tK.png","isPro":false,"fullname":"xuluxin","user":"2041Xu","type":"user"},{"_id":"69971bf23359d56801b2d0c6","avatarUrl":"/avatars/b127c7fb36a15c7788dd7c09e32d31b8.svg","isPro":false,"fullname":"Bfau8duqnsnm","user":"bfau8duqnsnm","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68ef0ab704f0f03d81964936","name":"ict-cas","fullname":"Institute of Computing Technology, Chinese Academy of Sciences","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/656ad93853703dd78f3de7b8/MR5EsF33Ev3IdnMOHv1be.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.13734.md"}">

Papers

arxiv:2605.13734

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Published on May 13

· Submitted by

Zedong Liu on May 22

Institute of Computing Technology, Chinese Academy of Sciences

Upvote

Authors:

Abstract

KVServe is a service-aware and adaptive framework for optimizing key-value communication compression in disaggregated large language model serving, achieving significant improvements in job completion time and time-to-first-token reduction through dynamic optimization.

AI-generated summary

LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present KVServe, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing 50times offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to 9.13times JCT speedup in PD-separated serving and up to 32.8times TTFT reduction in KV-disaggregated serving.

View arXiv page View PDF GitHub 7 Add to collection

Community

CapitalLiu

Paper submitter about 10 hours ago

We’re excited to share that KVServe has been accepted by SIGCOMM 2026 🎉

KVServe is a service-aware KV cache compression framework for communication-efficient disaggregated LLM serving. It treats KV compression as a runtime strategy selection problem rather than a fixed configuration, and achieves up to 9.13× JCT speedup in PD-separated serving and 32.8× TTFT reduction in KV-disaggregated serving. :)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.13734

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.13734 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.13734 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.13734 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

No comments yet. Sign in and be the first to say something.

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 0

Discussion (0)

More from Hugging Face Daily Papers