r/LocalLLaMA · · 5 min read

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

What it is, in plain words. Your GPU keeps two float vectors for every token of your conversation. That’s the KV cache, and it’s why long contexts eat VRAM: Llama-3.1-8B needs about 0.12 MB per token, so 100k tokens costs 12 GB and a million tokens costs 122 GB. No consumer card holds that, so when it stops fitting, serving stacks quietly delete the oldest tokens. The model isn’t lying when it says “that wasn’t provided.” It really doesn’t have it anymore.
InfiniteKV splits the memory in two, the way a computer does. The most recent 256 tokens stay exact in GPU memory, like hot RAM. Every older token is pressed into a 104-byte record that can live in ordinary RAM or in plain files on your hard disk. The records are searchable: for every new token the model generates, the cache pulls back the most relevant old tokens and the model attends over those plus the recent window. Nothing is ever deleted. At a million tokens that’s roughly 3 GB of records instead of 122 GB of float16, small enough for the machine you already own.
The receipts. Everything below is verified, the code that produced it is in the repo, and every result ships as a JSON receipt with hashes and an environment fingerprint, so you can check nothing was edited after the fact. Run the same commands and compare.
• Past the trained window. Mistral-7B (trained to 32,768) answered a buried passkey at token 76,747. Production-style sliding-window serving answered “not provided in the text.” SmolLM2 (trained to 8,192) answered at 12,048 while its unmodified self printed gibberish.
• Not sitting in the recent window. The key was about 38,000 tokens outside it. Cut the cold retrieval and the same model on the same context starts making things up. Restore it and the key comes back. That ablation is a hard assert in the test suite.
• Not secretly in VRAM. Archive mode keeps the cold records in memory-mapped files on disk. You can ls them: 640 MB of files on the drive, 11.5 MB of signatures in VRAM where float16 would have needed 461 MB.
• Reasoning over retrieved memory. Algebra at temperature 0: x is defined 2,700 tokens before the question and the model computes 3x + 5 = 56 from the compressed cache, same as the unmodified model. The word problem is my favorite transcript: the figures (240 sacks a day, 16 per cart) sit 3,000 tokens back, and the model quotes both numbers word for word out of the compressed records, then divides: 15. Verbatim transcripts in the repo.
• Output quality. Full-vocab KL divergence against the unmodified model: median around 0.002, and at 8k context the drift goes down, not up. Top-1 agreement about 0.95. Greedy decode matches token for token in the equivalence gate.
• Weights untouched. SHA-256 over every tensor before enable and after disable. Byte-exact.
Why only seven models. Short version: budget. My machine is a Dell Precision laptop with a 16 GB RTX 3080, so I certified everything that fits on it, and rented a RunPod box for the two big ones. Six certified plus the small one I use for the wall test. Instead of getting overwhelmed trying to cover every LLM out there, I’d rather give you solid proof on a few and a clear list of where it’s weak. Also, my own local LLM is running on this cache right now as my daily driver, and in a few weeks I’ll post the real-world benchmark that actually matters: weeks of normal use.
One knob. top_k_cold sets how many cold records come back per generated token. It auto-tunes to the model (32 to 64, the settings all the published numbers were measured at). The cache compresses tokens, not facts, and one fact is about a dozen tokens, so the default basket holds a fact comfortably. If your documents are dense with facts, contracts, reference docs, code with many definitions, turn it up to 96 and the basket just gets bigger, for a modest speed cost.
Try it in one click. There’s a Colab badge at the top of the repo. Free T4, about ten minutes: it buries a passkey and retrieves it, measures the KL divergence in front of you, verifies the weights byte for byte, answers from disk files past the trained window, and prints the memory bill. Everything else, including every transcript and benchmark command, is here: <github.com/QLNI/InfiniteKV>
What it isn’t. It’s not perfect and it isn’t going to be perfect yet. I’m attaching something these models were never trained for. The reference implementation is plain PyTorch and slow; if you write a CUDA or C++ kernel for the Hamming scan, I’ll take that PR gladly. Sliding-window models get the hot tier only, and MLA models need an adapted method I haven’t built.
And yes, I used Claude while building this. It’s a tool and it helped. I know how to write code and I’m not dependent on it. Either way, every number above comes from a test you can run yourself, which is the only part that should matter in my opinion.
Fair warning: the repo is a bit poetic. Hope you don’t judge it on that. I just don’t like my repo looking boring to me, and I had some extra time, so I worked on it.
Hope it helps someone. Thanks.

submitted by /u/Final-Data-1410
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA