r/LocalLLaMA · May 24, 2026 · 4 min read

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

A few weeks ago, after finishing FastDMS, I started toying around writing some RDNA3 kernels again to see how fast I could get Qwen 3.6 MoE running. It turned out well enough, so over the past couple weeks, I turned those experiments into hipEngine, a new open source (AGPLv3) ROCm-native local LLM inference engine.

It's Python based, but with no heavy PyTorch dependency. All the hot-path is HIP/C++, making liberal use of AMD native libs like hipBLASLt, hipGraph, AOTriton, etc.

gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900)

The initial implementation has Qwen 3.6 (MoE and dense) running competitively with llama.cpp, with the ParoQuant (which I've also ported to be ROCm compatible) 4.68bpw having better c=1 prefill ("prompt processing") at every tested context length, from 512-128K on gfx1100 (W7900/7900 XTX):

Prefill tok/s

Workload	hipEngine PARO	hipEngine GGUF Q4_K_S	llama.cpp HIP	llama.cpp Vulkan
512/128	2718.497	2258.847	2436.049	1816.927
4K/128	2838.773	2576.673	2176.905	1705.093
32K/128	2074.699	1893.967	1496.409	1128.554
128K/128	1055.454	998.143	710.213	480.539

Decode tok/s

Workload	hipEngine PARO	hipEngine GGUF Q4_K_S	llama.cpp HIP	llama.cpp Vulkan
512/128	103.460	109.152	85.487	127.515
4K/128	101.964	100.048	87.375	120.163
32K/128	90.438	86.774	76.994	98.073
128K/128	59.598	57.954	57.341	64.478

Peak GiB

Workload	hipEngine PARO	hipEngine GGUF Q4_K_S	llama.cpp HIP	llama.cpp Vulkan
512/128	20.962	25.108	21.125	20.844
4K/128	21.906	25.108	21.197	20.969
32K/128	22.016	25.108	21.738	21.533
128K/128	22.122	25.108	23.605	23.596

It also has the lowest peak memory usage at 128K. hipEngine also has near-lossless INT8 KVCache (with almost no speed-loss), meaning that you can run the full Qwen 3.6 256K context window in <24GB (eg, on a dedicated 7900 XTX) at good performance on RDNA3:

Model	Context	KV cache	Sampled peak	Allocator peak	Retained KV	Prefill	Decode
Qwen3.6 35B-A3B PARO	128K	BF16	21.04 GiB	21.88 GiB	2.69 GiB	1091.9 tok/s	62.2 tok/s
Qwen3.6 35B-A3B PARO	128K	INT8	19.80 GiB	20.89 GiB	1.36 GiB	1076.5 tok/s	60.0 tok/s
Qwen3.6 35B-A3B PARO	256K	INT8	21.96 GiB	23.71 GiB	2.71 GiB	670.2 tok/s	40.3 tok/s

gfx1151 (AMD Ryzen AI MAX+ 395 / Radeon 8060S)

I currently don't have a dedicated Strix Halo machine for grinding kernels on, but I'm happy to say that only minimal targeted optimization, it is already quite fast for gfx1151:

Prefill tok/s

Workload	hipEngine PARO	llama.cpp HIP	llama.cpp Vulkan
512/128	983.206	1058.738	638.008
4K/128	1029.402	1004.220	595.400
32K/128	792.296	735.534	407.984
128K/128	413.489	376.070	181.453

Decode tok/s

Workload	hipEngine PARO	llama.cpp HIP	llama.cpp Vulkan
512/128	62.060	50.537	57.615
4K/128	63.605	49.379	55.027
32K/128	50.629	43.435	44.576
128K/128	30.245	31.286	26.935

GGUF

One thing you might notice in the gfx1100 tables is that hipEngine also now has initial support for GGUF. This is something that I figured would be easy to add (not quite, took a more few days and billions of cached agentic coding tokens humming in the background than I would have expected), but I got Q4_K_M and Q4_K_S into a "good enough" initial state - a little behind the ParoQuant path in speeds, but it does open up future compatibility and does not require any custom training (ParoQuant models can take days to quant).

Implementation Notes

hipEngine was packaged up mostly as an fun sidequest/experiment, but inspired by DS4, it seems useful enough to package up and and share with any RDNA3 users. It's designed to allow expansion to different model architectures (maybe Gemma 4 or StepFun 3.5 next), and to different hardware as well.

I've also shared some docs/ in the repo for those interested:

KERNELS.md - this is the list of 100+ custom kernels with both fused and unfused kernels (and CPU-reference oracle) for correctness
ROOFLINE.md and ROOFLINE-gfx1151.md - for AMD GPU nerds, this is part of why I decided to go down the path since there's so much theoretical performance on the table, although even reducing kernel launches, and many iterations, it turns out that
LESSONS-LEARNED.md - some notes on what worked and didn't work while optimizing.

I'd encourage anyone with an interest/inkling to poke around, review the docs, generate their own code/optimizations, etc, but a couple of notes w/ the hipEngine code-base in particular: hipEngine is AGPLv3 licensed - it's a strong copy-left license. Anyone is free to use and modify however they want, but if you redistribute any part of it, you must share alike.

Also, while this post was entirely typed by hand into a textbox, the kernel optimization is the result of hundreds (thousands?) of rounds of AI-assisted generation and is not suitable for use/adoption by code-bases with strict anti-AI policies.

NOTE: this is very early code - all the numerics have been very carefully tested, the model inferences well for me, but if you're trying to install this, you might want to use an AI agent to help if you run into HIP/ROCm problems.

submitted by /u/randomfoo2
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

gfx1100 (Radeon RX 7900 XTX / Radeon Pro W7900)

Prefill tok/s

Decode tok/s

Peak GiB

gfx1151 (AMD Ryzen AI MAX+ 395 / Radeon 8060S)

Prefill tok/s

Decode tok/s

GGUF

Implementation Notes

Discussion (0)

More from r/LocalLLaMA