[llama.cpp] Asymmetric KV q8/q4 cache: current caveats and discussion in GGML repo
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Probably most of you are aware that using anything other than -ctk q8_0 -ctv q8_0 / -ctk q4_0 -ctv q4_0 as startup options for llama.cpp leads to prompt processing on cpu instead of gpu for cuda at least. E.g. when we use the frequently suggested mix of -ctk q8_0 -ctv q4_0 pps tanks.
I have discussed this with a prop LLM and it suggested to add some slight modifications to the cuda source code of llama.cpp or use cmake -DGGML_CUDA_FA_ALL_QUANTS=ON .. which will take very long.
But coincidentially, user sanmai on github did a small eval and suggested to include the kv cache quant combo during compilation, even without FA_ALL_QUANTS, so that would be great.
Discussion is here, it is worth a read as the eval confirms that using the async 8/4 bit kv quant only costs 1.3% precision while saving more than half of memory compared to f16/f16:
[link] [comments]
More from r/LocalLLaMA
-
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)
May 22
-
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
May 22
-
trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser
May 22
-
ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.