r/LocalLLaMA · June 19, 2026 · 1 min read

EvoTensile: Evolutionary algorithms for AMD Tensile GEMM kernel tuning

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

There has been an effort to tune kernels in hipBLASLt so the most basic matmuls can run faster. It's known that on Strix Halo (gfx1151), GEMM with NN and TN input layouts (used in inference) are already well-tuned, while NT and TT layouts (used in training) are not yet tuned.

The tool we use to tune the kernels is named Tensile (to be specific, it's TensileLite, not the original Tensile). It can generate a kernel from many tunable parameters. The remaining problem is to search for the best parameters that generate the fastest kernel for each input shape, and do it on various input shapes. There are some surrogates such as Formocast and Origami that may help the search, but they cannot yet predict the performance of gfx1151.

I've created EvoTensile that does the search with evolutionary algorithms, and it seems to work. I've tuned the NT layout on 100 input shapes. The speed is improved like from 20 to 40 TFLOPS. Compared to the theoretical roofline of 59.4 TFLOPS, I think 40 TFLOPS is good enough.

EvoTensile repo: https://github.com/woct0rdho/evotensile

My forked rocm-libraries: https://github.com/woct0rdho/rocm-libraries . You can build it and test the speedup.

My previous issue tracking the performance: https://github.com/ROCm/TheRock/issues/5314

I'm going to tune it on a larger grid of input shapes. If some AMD developers see this, I hope you can do some more extensive verifications of correctness and performance for the tuned configs, so eventually we can merge it into the mainstream rocm-libraries.

I knew what you might ask: No, it cannot yet tune the fused dequant-matmul kernels in llama.cpp . Maybe we can write them in some Tensile-like primitives and tune them.

submitted by /u/woct0rdho
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA