ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Them boys can cook, one big fix after another!
If you're running --sm tensor on multi-gpu this is the KV cache quantization fix
https://github.com/ggml-org/llama.cpp/releases/tag/b9455
JohannesGaesslercommented5 days ago
This PR implements support for the combination of -sm tensor and quantized KV cache. The reason why this doesn't work on master is that the flattening of tensors for the KV cache rotation leads to the loss of shape information which the meta backend cannot handle. There were previous PRs which resolved the issue by changing the shapes of the KV cache rotation but that is an undesirable solution because batched matrix multiplications may not be as well-supported in ggml backends as a single large matrix multiplication. Also it is generally better to extend the meta backend with capabilities to handle a compute graph than to require compute graphs to conform to the meta backend's requirments.
The approach in this PR is to extend the specification ggml_backend_meta_split_state with a value that specifies how often a given segment repeats. When a tensor is flattened the meta backend uses segments to specify the data layout within the flattened dimension so that upon a further reshape the correct data layout can be restored. No changes to the llama.cpp compute graphs are required.
[link] [comments]
More from r/LocalLLaMA
-
Moss tts 1.5 8b Examples. It is the currently best voice cloning model for English as of June 2026
Jun 2
-
Man trains local model to detect and kill mosquitos with a laser
Jun 2
-
I hate to be this guy but: Any good, recent CODING models in the 70-80B range?
Jun 1
-
Stop asking what model to run. There are literally only two.
Jun 1
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.