Shard - getting to 10× KV cache compression
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
TL;DR. Shard is a drop-in HuggingFace Cache that makes Llama-3.1-8B's KV memory about 10× smaller at 8K context (11× at 32K) without measurable hits to NIAH or LongBench. It started as a reimplementation of Google's TurboQuant[1], stalled around 4×, and ended up as a different design once we noticed K and V need different treatments: PCA plus int4 quantization on K (the matrix is effectively low-rank once you undo RoPE), and a Hadamard rotation plus vector quantization on V. Attention runs directly on the compressed K, no fp16 reconstruction. Code: krish1905/shard.
[link] [comments]
More from r/LocalLLaMA
-
Qwen3.5 35B A3B uncensored heretic Native MTP Preserved is Out Now With the Full 785 MTPs Preserved and Retained, Available in Safetensors, GGUFs. NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats
May 26
-
CXMT started selling ram to corsair
May 26
-
One letter to appease them all
May 26
-
New local model reaching near frontier on PII removal at 9 ms CPU inference
May 26
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.