r/LocalLLaMA · · 4 min read

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into hipEngine, a new open source (AGPLv3) ROCm-native local LLM inference engine.

It's Python based, but with no heavy PyTorch dependency. All the hot-path is HIP/C++, making liberal use of AMD native libs like hipBLASLt, hipGraph, AOTriton, etc.

gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900)

The initial implementation has Qwen 3.6 (MoE and dense) running competitively with llama.cpp, with the ParoQuant (which I've also ported to be ROCm compatible) 4.68bpw having better c=1 prefill ("prompt processing") at every tested context length, from 512-128K on gfx1100 (W7900/7900 XTX):

Prefill tok/s

Workload hipEngine PARO hipEngine GGUF Q4_K_S llama.cpp HIP llama.cpp Vulkan
512/128 2718.497 2258.847 2436.049 1816.927
4K/128 2838.773 2576.673 2176.905 1705.093
32K/128 2074.699 1893.967 1496.409 1128.554
128K/128 1055.454 998.143 710.213 480.539

Decode tok/s

Workload hipEngine PARO hipEngine GGUF Q4_K_S llama.cpp HIP llama.cpp Vulkan
512/128 103.460 109.152 85.487 127.515
4K/128 101.964 100.048 87.375 120.163
32K/128 90.438 86.774 76.994 98.073
128K/128 59.598 57.954 57.341 64.478

Peak GiB

Workload hipEngine PARO hipEngine GGUF Q4_K_S llama.cpp HIP llama.cpp Vulkan
512/128 20.962 25.108 21.125 20.844
4K/128 21.906 25.108 21.197 20.969
32K/128 22.016 25.108 21.738 21.533
128K/128 22.122 25.108 23.605 23.596

It also has the lowest peak memory usage at 128K. hipEngine also has near-lossless INT8 KVCache (with almost no speed-loss), meaning that you can run the full Qwen 3.6 256K context window in <24GB (eg, on a dedicated 7900 XTX) at good performance on RDNA3:

Model Context KV cache Sampled peak Allocator peak Retained KV Prefill Decode
Qwen3.6 35B-A3B PARO 128K BF16 21.04 GiB 21.88 GiB 2.69 GiB 1091.9 tok/s 62.2 tok/s
Qwen3.6 35B-A3B PARO 128K INT8 19.80 GiB 20.89 GiB 1.36 GiB 1076.5 tok/s 60.0 tok/s
Qwen3.6 35B-A3B PARO 256K INT8 21.96 GiB 23.71 GiB 2.71 GiB 670.2 tok/s 40.3 tok/s

gfx1151 (AMD Ryzen AI MAX+ 395 / Radeon 8060S)

I currently don't have a dedicated Strix Halo machine for grinding kernels on, but I'm happy to say that only minimal targeted optimization, it is already quite fast for gfx1151:

Prefill tok/s

Workload hipEngine PARO llama.cpp HIP llama.cpp Vulkan
512/128 983.206 1058.738 638.008
4K/128 1029.402 1004.220 595.400
32K/128 792.296 735.534 407.984
128K/128 413.489 376.070 181.453

Decode tok/s

Workload hipEngine PARO llama.cpp HIP llama.cpp Vulkan
512/128 62.060 50.537 57.615
4K/128 63.605 49.379 55.027
32K/128 50.629 43.435 44.576
128K/128 30.245 31.286 26.935

GGUF

One thing you might notice in the gfx1100 tables is that hipEngine also now has initial support for GGUF. This is something that I figured would be easy to add (not quite, took a more few days and billions of cached agentic coding tokens humming in the background than I would have expected), but I got Q4_K_M and Q4_K_S into a "good enough" initial state - a little behind the ParoQuant path in speeds, but it does open up future compatibility and does not require any custom training (ParoQuant models can take days to quant).

Implementation Notes

hipEngine was packaged up mostly as an fun sidequest/experiment, but inspired by DS4, it seems useful enough to package up and and share with any RDNA3 users. It's designed to allow expansion to different model architectures (maybe Gemma 4 or StepFun 3.5 next), and to different hardware as well.

I've also shared some docs/ in the repo for those interested:

  • KERNELS.md - this is the list of 100+ custom kernels with both fused and unfused kernels (and CPU-reference oracle) for correctness
  • ROOFLINE.md and ROOFLINE-gfx1151.md - for AMD GPU nerds, this is part of why I decided to go down the path since there's so much theoretical performance on the table, although even reducing kernel launches, and many iterations, it turns out that
  • LESSONS-LEARNED.md - some notes on what worked and didn't work while optimizing.

I'd encourage anyone with an interest/inkling to poke around, review the docs, generate their own code/optimizations, etc, but a couple of notes w/ the hipEngine code-base in particular: hipEngine is AGPLv3 licensed - it's a strong copy-left license. Anyone is free to use and modify however they want, but if you redistribute any part of it, you must share alike.

Also, while this post was entirely typed by hand into a textbox, the kernel optimization is the result of hundreds (thousands?) of rounds of AI-assisted generation and is not suitable for use/adoption by code-bases with strict anti-AI policies.

NOTE: this is very early code - all the numerics have been very carefully tested, the model inferences well for me, but if you're trying to install this, you might want to use an AI agent to help if you run into HIP/ROCm problems.

submitted by /u/randomfoo2
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA