Hugging Face Daily Papers · June 16, 2026 · 3 min read

TokenPilot: Cache-Efficient Context Management for LLM Agents

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

TokenPilot cuts the cost of long-horizon LLM agents by making context shorter without breaking the prompt cache.</p>\n","updatedAt":"2026-06-16T02:13:39.866Z","author":{"_id":"620b3bbb0668e435407c8d0a","avatarUrl":"/avatars/e0fccbb2577d76088e09f054c35cffbc.svg","fullname":"Ningyu Zhang","name":"Ningyu","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":50,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.7133477330207825},"editors":["Ningyu"],"editorAvatarUrls":["/avatars/e0fccbb2577d76088e09f054c35cffbc.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.17016","authors":[{"_id":"6a30b03fa0d4daae4285fd1f","name":"Buqiang Xu","hidden":false},{"_id":"6a30b03fa0d4daae4285fd20","name":"Zirui Xue","hidden":false},{"_id":"6a30b03fa0d4daae4285fd21","name":"Dianmou Chen","hidden":false},{"_id":"6a30b03fa0d4daae4285fd22","name":"Chenyang Fu","hidden":false},{"_id":"6a30b03fa0d4daae4285fd23","name":"Chiyu Wu","hidden":false},{"_id":"6a30b03fa0d4daae4285fd24","name":"Caiying Huang","hidden":false},{"_id":"6a30b03fa0d4daae4285fd25","name":"Chen Jiang","hidden":false},{"_id":"6a30b03fa0d4daae4285fd26","name":"Jizhan Fang","hidden":false},{"_id":"6a30b03fa0d4daae4285fd27","name":"Xinle Deng","hidden":false},{"_id":"6a30b03fa0d4daae4285fd28","name":"Yijun Chen","hidden":false},{"_id":"6a30b03fa0d4daae4285fd29","name":"Yunzhi Yao","hidden":false},{"_id":"6a30b03fa0d4daae4285fd2a","name":"Xuehai Wang","hidden":false},{"_id":"6a30b03fa0d4daae4285fd2b","name":"Jin Shang","hidden":false},{"_id":"6a30b03fa0d4daae4285fd2c","name":"Gong Yu","hidden":false},{"_id":"6a30b03fa0d4daae4285fd2d","name":"Ningyu Zhang","hidden":false}],"publishedAt":"2026-06-15T00:00:00.000Z","submittedOnDailyAt":"2026-06-16T00:00:00.000Z","title":"TokenPilot: Cache-Efficient Context Management for LLM Agents","submittedOnDailyBy":{"_id":"620b3bbb0668e435407c8d0a","avatarUrl":"/avatars/e0fccbb2577d76088e09f054c35cffbc.svg","isPro":false,"fullname":"Ningyu Zhang","user":"Ningyu","type":"user","name":"Ningyu"},"summary":"As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.","upvotes":11,"discussionId":"6a30b040a0d4daae4285fd2e","githubRepo":"https://github.com/zjunlp/LightMem2","githubRepoAddedBy":"user","ai_summary":"TokenPilot is a dual-granularity context management framework that reduces inference costs in long-horizon LLM sessions by stabilizing prompt prefixes and conservatively managing context segments.","ai_keywords":["LLM agents","context management","token footprints","prompt cache continuity","Ingestion-Aware Compaction","Lifecycle-Aware Eviction","residual utility","batch-turn schedule","continuous mode","isolated mode"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":10,"organization":{"_id":"620a6fcd8d5e5dfed284bc91","name":"zjunlp","fullname":"ZJUNLP","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1644851027419-620a61cba53066560e226d30.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"620b3bbb0668e435407c8d0a","avatarUrl":"/avatars/e0fccbb2577d76088e09f054c35cffbc.svg","isPro":false,"fullname":"Ningyu Zhang","user":"Ningyu","type":"user"},{"_id":"6a0c1e5139d601217d9b3e8e","avatarUrl":"/avatars/bc27ca94a598dd902d591cbdee597f0c.svg","isPro":false,"fullname":"Leonardo Garate","user":"Opaquing","type":"user"},{"_id":"65d6cb9cf8729e233342ca23","avatarUrl":"/avatars/5c70f8818ea4134bb8eb6bbcbfdf071a.svg","isPro":false,"fullname":"Huxley","user":"dhao2001","type":"user"},{"_id":"6698c1c3157ceb76c48ff996","avatarUrl":"/avatars/2f1d732c4d9df4f5b554268ee1949dda.svg","isPro":false,"fullname":"徐步强","user":"Xubqpanda","type":"user"},{"_id":"66abc6da92b9eb71fe476118","avatarUrl":"/avatars/6d1618f45cc76da80335ad926ad24552.svg","isPro":false,"fullname":"xy.r","user":"ShawnRu","type":"user"},{"_id":"674ad5f5548e472d0ed8cdfe","avatarUrl":"/avatars/87f083671fc019b13e31c6ca4b009daa.svg","isPro":false,"fullname":"Pan XG","user":"slaanurgle","type":"user"},{"_id":"696084a54644e35c1528b166","avatarUrl":"/avatars/49533dfbedff7c66dcfa2c90d07f8516.svg","isPro":false,"fullname":"CHEN","user":"FuCY","type":"user"},{"_id":"65cad52fd6c974694fc20b8e","avatarUrl":"/avatars/8232a7c5db590ed26751a47c45d481b8.svg","isPro":false,"fullname":"Xinle Deng","user":"Linear-Matrix-Probability","type":"user"},{"_id":"69d4de6af00d07819c7debd8","avatarUrl":"/avatars/90be7fe77f8a2e6a02798f847289e164.svg","isPro":false,"fullname":"陈殿谋","user":"ccddmm","type":"user"},{"_id":"68d8fc00ff474874c83a1c99","avatarUrl":"/avatars/17e3a2f5197274536bf68d949c5416db.svg","isPro":false,"fullname":"huminclu","user":"huminclu","type":"user"},{"_id":"6270324ebecab9e2dcf245de","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6270324ebecab9e2dcf245de/cMbtWSasyNlYc9hvsEEzt.jpeg","isPro":false,"fullname":"Kye Gomez","user":"kye","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"620a6fcd8d5e5dfed284bc91","name":"zjunlp","fullname":"ZJUNLP","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/1644851027419-620a61cba53066560e226d30.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.17016.md","query":{}}">

Papers

arxiv:2606.17016

TokenPilot: Cache-Efficient Context Management for LLM Agents

Published on Jun 15

· Submitted by

Ningyu Zhang on Jun 16

ZJUNLP

Upvote

Authors:

Abstract

TokenPilot is a dual-granularity context management framework that reduces inference costs in long-horizon LLM sessions by stabilizing prompt prefixes and conservatively managing context segments.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

As LLM agents are deployed in long-horizon sessions, context accumulation drives up inference costs. Existing approaches utilize text pruning or dynamic memory eviction to minimize token footprints; however, their unconstrained sequence mutations alter layouts, introducing prefix mismatches and cache invalidation. This reveals a critical trade-off between text sparsity and prompt cache continuity. To address this, we present TokenPilot, a dual-granularity context management framework. Globally, Ingestion-Aware Compaction acts as a framework harness to stabilize prompt prefixes and eliminate open-world environmental noise at the ingestion gate. Locally, Lifecycle-Aware Eviction monitors the ongoing residual utility of context segments, enforcing a conservative batch-turn schedule to offload content segments only when task relevance expires. Experiments on PinchBench and Claw-Eval under both isolated and continuous modes demonstrate that TokenPilot reduces costs by 61% and 56% in isolated mode, and 61% and 87% in continuous mode, while maintaining competitive performance compared to prior systems. TokenPilot has been integrated into LightMem2 at https://github.com/zjunlp/LightMem2.

View arXiv page View PDF GitHub 10 Add to collection

Community

Ningyu

Paper submitter about 11 hours ago

TokenPilot cuts the cost of long-horizon LLM agents by making context shorter without breaking the prompt cache.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.17016

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.17016 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.17016 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.17016 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

TokenPilot: Cache-Efficient Context Management for LLM Agents

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers