r/LocalLLaMA · · 3 min read

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Hey fellow Llamas, keeping it short.

We just shipped DFlash and PFlash support for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). Same Luce DFlash stack from the RTX 3090 post a couple weeks back, now running on the consumer AMD APU class.

Repo: https://github.com/Luce-Org/lucebox-hub (MIT)

TL;DR

End-to-end on Qwen3.6-27B Q4_K_M with the Luce Q8_0 DFlash drafter: 26.85 tok/s decode and 20.2 s prefill at 16K context.

That is 2.23x faster decode and 3.05x faster prefill than llama.cpp HIP on the same silicon. At a 16K prompt + 1K generation workload, total wall clock drops from 147 s to 58 s, 2.5x faster end to end.

The same 128 GiB box hosts checkpoints up to ~100 GiB, a class of models a 24 GiB consumer GPU cannot touch (Qwen3.5-122B-A10B, MiniMax-M2.7-REAP 139B-A10B, full BF16 27B).

The numbers

Hardware: Ryzen AI MAX+ 395, Radeon 8060S iGPU (gfx1151), 128 GiB LPDDR5X-8000, ROCm 7.2.2 Target: Qwen3.6-27B Q4_K_M (15.65 GiB) Drafter: Lucebox/Qwen3.6-27B-DFlash-GGUF Q8_0 with DFLASH27B_DRAFT_SWA=2048 Bench: 10-prompt HumanEval-style, --n-gen 128 --ddtree-budget 22 --fast-rollback

Decode (Qwen3.6-27B Q4_K_M, tok/s):

Engine tok/s vs AR
llama.cpp HIP AR 12.02 1.00x
llama.cpp Vulkan AR 12.45 1.04x
Luce DFlash (this PR) 26.85 2.23x

Prefill (Qwen3.6-27B, 16K tokens):

Engine TTFT vs AR
llama.cpp HIP AR 61.69 s 1.00x
Luce PFlash 20.2 s 3.05x

Speedup grows with context: PFlash compress is O(S), AR prefill is O(S^2). NIAH retrieval still passes at 16K.

Tuning note: --ddtree-budget=22 is the gfx1151 optimum. Higher budgets accept more tokens per step but each step gets more expensive on LPDDR5X. Bandwidth caps the benefit before tile utilization pays off. Contrast with gfx1100 (7900 XTX, GDDR6 936 GB/s) where budget=8 wins, tile waste matters more than launch amortization. Default ship is arch-aware.

Reproduce

bash

# 1. Build PR #119 for gfx1151 git clone https://github.com/Luce-Org/lucebox-hub.git cd lucebox-hub git fetch origin pull/119/head:pr119 && git checkout pr119 git submodule update --init --recursive cd dflash cmake -B build -S . \ -DCMAKE_BUILD_TYPE=Release \ -DDFLASH27B_GPU_BACKEND=hip \ -DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \ -DDFLASH27B_HIP_SM80_EQUIV=ON cmake --build build --target test_dflash -j # 2. Models: Qwen3.6-27B target + Lucebox Q8_0 DFlash drafter mkdir -p models/draft hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models/ hf download Lucebox/Qwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models/draft/ # 3. Bench (DFlash decode + PFlash long-context prefill) LD_LIBRARY_PATH=/opt/rocm/lib:$LD_LIBRARY_PATH \ DFLASH_BIN=$PWD/build/test_dflash \ DFLASH_TARGET=$PWD/models/Qwen3.6-27B-Q4_K_M.gguf \ DFLASH_DRAFT=$PWD/models/draft/dflash-draft-3.6-q8_0.gguf \ DFLASH27B_DRAFT_SWA=2048 \ DFLASH27B_PREFILL_UBATCH=512 \ python3 scripts/bench_he.py --n-gen 128 --ddtree-budget 22 

DFLASH27B_PREFILL_UBATCH=512 applies the PR #159 fix on top of PR #119. Once #159 merges, this is the daemon default.

What is still missing

  • BSA scoring kernel on HIP. The drafter compress-score path uses BSA (block-sparse attention) on CUDA. PR #119 disables it on HIP and falls back to ggml's flash_attn_ext, which the daemon's own warning flags as ~3.4x slower. A rocWMMA-native sparse-FA kernel closes the gap. After it lands, PFlash TTFT at 16K drops from 27.6 s to roughly 8 s. At 128K, projected 7-10x over llama.cpp AR.
  • Multi-row q4_K decode GEMV. RDNA-native multi-row pattern (R=4-8 output rows sharing activation register state) for the drafter forward, currently 30% of compress time at long context.
  • Phase 2 tile shape tuning for gfx1151. Current rocWMMA flashprefill tiles are tuned for gfx1100. Strix Halo has different LDS and VGPR characteristics.
  • 70B+ MoE targets. 128 GiB headroom is wasted on a 27B. Qwen3.5-122B-A10B and MiniMax-M2.7-REAP 139B-A10B both fit. DFlash math ports cleanly to MoE; big work is wiring the expert-routed forward into the spec verify loop.

Constraints

ROCm 7.2.2+, gfx1151 tuned (gfx1100 also supported with arch-aware defaults), greedy verify only, no Vulkan / Metal / multi-GPU on this path yet.

We're working hard on this but we know we need to improve on many things.

Feedback is more than welcome :)

submitted by /u/sandropuppo
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA