Strange numbers of pp and tg rx7900xtx on ROCm and Vulcan with Qwen3.6-27b nonMTP and MTP
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
So I'm getting very unsatisfactory results of running this model locally.
| Item | Current |
|---|---|
| OS | Ubuntu 24.04.4 LTS |
| Linux kernel | 6.8.0-124-generic |
| GPU | RX 7900 XTX / gfx1100 |
| llama.cpp | b9630 / 8ed274ef4 |
| ROCm | 7.2.4 |
| AMD driver | 6.16.13 |
| Vulkan | API 1.4.330, Mesa 26.0.0-devel |
Raw Backend Benchmarks, No Speculative MTP
| Backend | Model file | Prompt test | Prompt tok/s | Decode test | Decode tok/s |
|---|---|---|---|---|---|
| ROCm | Normal 27B | pp32768 | 235.73 | tg128 | 31.14 |
| Vulkan | Normal 27B | pp32768 | 634.80 | tg128 | 13.32 |
Real API Test, ROCm Only, 32,201 Prompt Tokens + 128 Gen
| Config | Prompt tok/s | Gen tok/s | Wall | Draft acceptance |
|---|---|---|---|---|
| Normal 27B | 238.42 avg | 26.84 avg | 139.8s avg | N/A |
MTP n=3 | 226.09 avg | 17.14 avg | 149.9s avg | 78.76% |
Basically it's working like shit. I tried vllm also but it's a dead end on my hw.
llama-server \ --model /models/Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf \ --host 0.0.0.0 \ --port 8000 \ --n-gpu-layers 99 \ --ctx-size 65565 \ --no-mmap \ --flash-attn on \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ --ubatch-size 2048 \ --parallel 1 \ --cont-batching \ --metrics llama-server \ --model /models/Qwen3.6-27B-UD-Q4_K_XL.gguf \ --host 127.0.0.1 \ --port 18080 \ --n-gpu-layers 99 \ --ctx-size 65565 \ --no-mmap \ --flash-attn on \ --ubatch-size 2048 \ --parallel 1 \ --cont-batching \ --metrics Any I ideas on how to improve that? Try to update kernel ? Idk I spent few days tweaking and trying different combinations. Post is asking more about total performance not only MTP enhancement....
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.