ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| This PR improves matmul performance for k-quants. The following table shows the improvement on the
[link] [comments] |
More from r/LocalLLaMA
-
2X tk/s (from 19.4 -> 38.1 tk/s on 1 x MI50) Playing with a hypothesis like speculative decoding.. but instead of an additional side model, exploiting that I can run multiple computations side-by-side AS IF I had Qwen3.6-27B loaded twice in memory - small quants don't use all…
Jun 9
-
Jetbrains Mellum 2: a really good and performant model
Jun 9
-
I fine-tuned Parakeet 0.6B for medical ASR — open weights, local Mac/CUDA/CPU
Jun 9
-
Pipeline parallelism in llama.cpp may be wasting your VRAM
Jun 8
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.