r/LocalLLaMA · · 1 min read

Shard - getting to 10× KV cache compression

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

TL;DR. Shard is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about 10× smaller at 8K context (11× at 32K) without measurable hits to NIAH or LongBench. It started as a reimplementation of Google's TurboQuant[1], stalled around 4×, and ended up as a different design once we noticed K and V need different treatments: PCA plus int4 quantization on K (the matrix is effectively low-rank once you undo RoPE), and a Hadamard rotation plus vector quantization on V. Attention runs directly on the compressed K, no fp16 reconstruction. Code: krish1905/shard.

submitted by /u/Thrumpwart
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA