r/LocalLLaMA · · 1 min read

ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Them boys can cook, one big fix after another!

If you're running --sm tensor on multi-gpu this is the KV cache quantization fix

https://github.com/ggml-org/llama.cpp/releases/tag/b9455

JohannesGaesslercommented5 days ago

This PR implements support for the combination of -sm tensor and quantized KV cache. The reason why this doesn't work on master is that the flattening of tensors for the KV cache rotation leads to the loss of shape information which the meta backend cannot handle. There were previous PRs which resolved the issue by changing the shapes of the KV cache rotation but that is an undesirable solution because batched matrix multiplications may not be as well-supported in ggml backends as a single large matrix multiplication. Also it is generally better to extend the meta backend with capabilities to handle a compute graph than to require compute graphs to conform to the meta backend's requirments.

The approach in this PR is to extend the specification ggml_backend_meta_split_state with a value that specifies how often a given segment repeats. When a tensor is flattened the meta backend uses segments to specify the data layout within the flattened dimension so that upon a further reshape the correct data layout can be restored. No changes to the llama.cpp compute graphs are required.

submitted by /u/Bulky-Priority6824
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA