r/LocalLLaMA · June 14, 2026 · 1 min read

Strange numbers of pp and tg rx7900xtx on ROCm and Vulcan with Qwen3.6-27b nonMTP and MTP

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

So I'm getting very unsatisfactory results of running this model locally.

Item	Current
OS	Ubuntu 24.04.4 LTS
Linux kernel	`6.8.0-124-generic`
GPU	RX 7900 XTX / `gfx1100`
llama.cpp	`b9630` / `8ed274ef4`
ROCm	`7.2.4`
AMD driver	`6.16.13`
Vulkan	API `1.4.330`, Mesa `26.0.0-devel`

Raw Backend Benchmarks, No Speculative MTP

Backend	Model file	Prompt test	Prompt tok/s	Decode test	Decode tok/s
ROCm	Normal 27B	`pp32768`	`235.73`	`tg128`	`31.14`
Vulkan	Normal 27B	`pp32768`	`634.80`	`tg128`	`13.32`

Real API Test, ROCm Only, 32,201 Prompt Tokens + 128 Gen

Config	Prompt tok/s	Gen tok/s	Wall	Draft acceptance
Normal 27B	`238.42 avg`	`26.84 avg`	`139.8s avg`	N/A
MTP `n=3`	`226.09 avg`	`17.14 avg`	`149.9s avg`	`78.76%`

Basically it's working like shit. I tried vllm also but it's a dead end on my hw.

llama-server \ --model /models/Qwen3.6-27B-MTP-UD-Q4_K_XL.gguf \ --host 0.0.0.0 \ --port 8000 \ --n-gpu-layers 99 \ --ctx-size 65565 \ --no-mmap \ --flash-attn on \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ --ubatch-size 2048 \ --parallel 1 \ --cont-batching \ --metrics llama-server \ --model /models/Qwen3.6-27B-UD-Q4_K_XL.gguf \ --host 127.0.0.1 \ --port 18080 \ --n-gpu-layers 99 \ --ctx-size 65565 \ --no-mmap \ --flash-attn on \ --ubatch-size 2048 \ --parallel 1 \ --cont-batching \ --metrics

Any I ideas on how to improve that? Try to update kernel ? Idk I spent few days tweaking and trying different combinations. Post is asking more about total performance not only MTP enhancement....

submitted by /u/Thin_Pollution8843
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA