r/LocalLLaMA · June 9, 2026 · 1 min read

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

This PR improves matmul performance for k-quants. The following table shows the improvement on the pp512 test in M2 pro.

quant	model	master (t/s)	PR (t/s)	speedup
Q2_K	qwen3 0.6B Q2_K - Medium	817.86 ± 6.14	1991.81 ± 6.87	2.44x
Q3_K	qwen35 4B Q3_K - Medium	92.54 ± 0.13	302.24 ± 0.37	3.27x
	gemma4 E4B Q3_K - Medium	79.06 ± 0.08	298.73 ± 0.90	3.78x
Q4_K	qwen35 4B Q4_K - Medium	243.82 ± 0.09	327.24 ± 0.59	1.34x
	gemma4 E4B Q4_K - Medium	238.44 ± 0.60	324.97 ± 5.74	1.36x
Q5_K	qwen35 4B Q5_K - Medium	231.23 ± 0.83	307.95 ± 2.93	1.33x
	gemma4 E4B Q5_K - Medium	229.46 ± 0.87	306.12 ± 3.28	1.33x
Q6_K	qwen35 4B Q6_K	216.19 ± 0.06	311.52 ± 0.05	1.44x
	gemma4 E4B Q6_K	198.79 ± 3.77	303.07 ± 3.28	1.52x

Discussion (0)

No comments yet. Sign in and be the first to say something.