r/LocalLLaMA · · 1 min read

CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

CUDA: add fast walsh-hadamard transform by am17an · Pull Request #23615 · ggml-org/llama.cpp

Implemented(by u/am17an) FWHT for CUDA, speed-up for cases when we quantize the kv-cache.

1-2% boost on pp & 7-9% boost on tg.

Performance on a 5090 with -ctk q8_0 -ctv q8_0

Model Test t/s master t/s cuda-fwt Speedup
gemma4 26B.A4B Q4_K_M pp2048 13587.89 13809.20 1.02
gemma4 26B.A4B Q4_K_M pp2048@d1024 12425.01 12553.32 1.01
gemma4 26B.A4B Q4_K_M pp2048@d2048 12158.21 12291.42 1.01
gemma4 26B.A4B Q4_K_M pp2048@d4096 11710.89 11913.97 1.02
gemma4 26B.A4B Q4_K_M pp2048@d8192 10982.21 11214.12 1.02
gemma4 26B.A4B Q4_K_M pp2048@d16384 9702.60 9776.75 1.01
gemma4 26B.A4B Q4_K_M tg128 223.81 243.90 1.09
gemma4 26B.A4B Q4_K_M tg128@d1024 210.06 228.02 1.09
gemma4 26B.A4B Q4_K_M tg128@d2048 217.53 235.28 1.08
gemma4 26B.A4B Q4_K_M tg128@d4096 216.76 234.05 1.08
gemma4 26B.A4B Q4_K_M tg128@d8192 209.40 226.06 1.08
gemma4 26B.A4B Q4_K_M tg128@d16384 204.54 219.74 1.07
submitted by /u/pmttyji
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA