r/LocalLLaMA · · 1 min read

vLLM PR adding native HIP W4A16 kernel was merged

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

vLLM PR adding native HIP W4A16 kernel was merged

The performance increase introduced by the PR is awesome. Makes my ROCm rig a lot more useful.

Numbers from the PR:

Kernel dtype max-num-seqs=8 max-num-seqs=32
Triton W4A16 bf16 82.4 tk/s -
Triton W4A16 fp16 83.2 tk/s -
ExLlama (no bf16) fp16 255.0 tk/s 382.5 tk/s
RDNA3 W4A16 (this PR) bf16 205.3 tk/s 382.5 tk/s
RDNA3 W4A16 (this PR) fp16 270.2 tk/s 445.7 tk/s

EDIT: The numbers are for Qwen3.6-27B-GPTQ-W4A16-G32.

See more here: PR link

submitted by /u/StupidityCanFly
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA