EvoTensile: Evolutionary algorithms for AMD Tensile GEMM kernel tuning
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
There has been an effort to tune kernels in hipBLASLt so the most basic matmuls can run faster. It's known that on Strix Halo (gfx1151), GEMM with NN and TN input layouts (used in inference) are already well-tuned, while NT and TT layouts (used in training) are not yet tuned.
The tool we use to tune the kernels is named Tensile (to be specific, it's TensileLite, not the original Tensile). It can generate a kernel from many tunable parameters. The remaining problem is to search for the best parameters that generate the fastest kernel for each input shape, and do it on various input shapes. There are some surrogates such as Formocast and Origami that may help the search, but they cannot yet predict the performance of gfx1151.
I've created EvoTensile that does the search with evolutionary algorithms, and it seems to work. I've tuned the NT layout on 100 input shapes. The speed is improved like from 20 to 40 TFLOPS. Compared to the theoretical roofline of 59.4 TFLOPS, I think 40 TFLOPS is good enough.
EvoTensile repo: https://github.com/woct0rdho/evotensile
My forked rocm-libraries: https://github.com/woct0rdho/rocm-libraries . You can build it and test the speedup.
My previous issue tracking the performance: https://github.com/ROCm/TheRock/issues/5314
I'm going to tune it on a larger grid of input shapes. If some AMD developers see this, I hope you can do some more extensive verifications of correctness and performance for the tuned configs, so eventually we can merge it into the mainstream rocm-libraries.
I knew what you might ask: No, it cannot yet tune the fused dequant-matmul kernels in llama.cpp . Maybe we can write them in some Tensile-like primitives and tune them.
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.