r/LocalLLaMA · · 1 min read

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

This PR improves matmul performance for k-quants. The following table shows the improvement on the pp512 test in M2 pro.

quant model master (t/s) PR (t/s) speedup
Q2_K qwen3 0.6B Q2_K - Medium 817.86 ± 6.14 1991.81 ± 6.87 2.44x
Q3_K qwen35 4B Q3_K - Medium 92.54 ± 0.13 302.24 ± 0.37 3.27x
gemma4 E4B Q3_K - Medium 79.06 ± 0.08 298.73 ± 0.90 3.78x
Q4_K qwen35 4B Q4_K - Medium 243.82 ± 0.09 327.24 ± 0.59 1.34x
gemma4 E4B Q4_K - Medium 238.44 ± 0.60 324.97 ± 5.74 1.36x
Q5_K qwen35 4B Q5_K - Medium 231.23 ± 0.83 307.95 ± 2.93 1.33x
gemma4 E4B Q5_K - Medium 229.46 ± 0.87 306.12 ± 3.28 1.33x
Q6_K qwen35 4B Q6_K 216.19 ± 0.06 311.52 ± 0.05 1.44x
gemma4 E4B Q6_K 198.79 ± 3.77 303.07 ± 3.28 1.52x
submitted by /u/pmttyji
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA