r/LocalLLaMA · · 1 min read

Strange numbers of pp and tg rx7900xtx on ROCm and Vulcan with Qwen3.6-27b nonMTP and MTP

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

So I'm getting very unsatisfactory results of running this model locally.

Item Current
OS Ubuntu 24.04.4 LTS
Linux kernel 6.8.0-124-generic
GPU RX 7900 XTX / gfx1100
llama.cpp b9630 / 8ed274ef4
ROCm 7.2.4
AMD driver 6.16.13
Vulkan API 1.4.330, Mesa 26.0.0-devel

Raw Backend Benchmarks, No Speculative MTP

Backend Model file Prompt test Prompt tok/s Decode test Decode tok/s
ROCm Normal 27B pp32768 235.73 tg128 31.14
Vulkan Normal 27B pp32768 634.80 tg128 13.32

Real API Test, ROCm Only, 32,201 Prompt Tokens + 128 Gen

Config Prompt tok/s Gen tok/s Wall Draft acceptance
Normal 27B 238.42 avg 26.84 avg 139.8s avg N/A
MTP n=3 226.09 avg 17.14 avg 149.9s avg 78.76%

Basically it's working like shit. I tried vllm also but it's a dead end on my hw.

llama-server \ --model /models/Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf \ --host 0.0.0.0 \ --port 8000 \ --n-gpu-layers 99 \ --ctx-size 65565 \ --no-mmap \ --flash-attn on \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ --ubatch-size 2048 \ --parallel 1 \ --cont-batching \ --metrics llama-server \ --model /models/Qwen3.6-27B-UD-Q4_K_XL.gguf \ --host 127.0.0.1 \ --port 18080 \ --n-gpu-layers 99 \ --ctx-size 65565 \ --no-mmap \ --flash-attn on \ --ubatch-size 2048 \ --parallel 1 \ --cont-batching \ --metrics 

Any I ideas on how to improve that? Try to update kernel ? Idk I spent few days tweaking and trying different combinations. Post is asking more about total performance not only MTP enhancement....

submitted by /u/Thin_Pollution8843
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA