AMD MI50 on Debian Testing is doing great and getting better.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
There is probably some relevant information to other cards here but my benchmarks are on dual MI50 32GB cards because that is what I have, and thought I would share with the community. Install instructions at the end. I'll put a dump of the full llama-benchy tables in a comment in case anyone wants them, they include 3,4, and 8 concurrency levels, too (edit: it won't post the comment, maybe because my internet sucks, I'll try to later).
For those that don't know, llama.cpp is available in the Debian testing repo, and so is updated vulkan, and a bit of a mishmash of ROCm and HIP library versions that work great and does still support the MI50 cards without doing anything tricky (at least nothing tricky for the end user, the package maintainer apparently handles any of the tricky work).
The llama.cpp apt package was recently updated to version 9413 so I decided to do some benchmarks (using llama-benchy, not llama-bench) to see what works best. I'm using unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL, with and without MTP, running on vulkan and rocm llama.cpp backends (exact commands at the end of this post).
llama-benchy results
Concurrency 1
| Backend | PP t/s | TG t/s |
|---|---|---|
| Vulkan | 977.18 | 55.34 |
| Vulkan-MTP | 937.96 | 89.76 |
| ROCm | 795.28 | 67.02 |
| ROCm-MTP | 759.22 | 92.69 |
Concurrency 2
| Backend | PP t/s | TG t/s |
|---|---|---|
| Vulkan | 1229.27 | 85.23 |
| Vulkan-MTP | 939.42 | 84.19 |
| ROCm | 913.80 | 88.46 |
| ROCm-MTP | 946.35 | 115.00 |
I do a lot of long context stuff, so I like higher PP if not too much of a sacrifice of TG, almost always single concurrency but maybe sometimes 2. So for my use, I'm going to be running Vulkan with MTP. For a long time I've been running the ROCm backend, installed from apt, so I know it's very stable and runs well, just FYI.
Before this update it didn't have MTP support, I was getting PP 700 and TG 55 using ROCm, and even that setup is faster now (tested with llama-bench, I wasn't using llama-benchy before and don't care to downgrade to retest). I don't know if that's updates to the ROCm libraries or llama.cpp, or a little of both.
Also, before using ROCm and llama.cpp from apt, I was manually installing ROCm 6.3.3 from AMD and llama.cpp from source, and switching to the apt packages had identical performance at that time (just to assure you, there was no loss of performance switching to the much-easier-to-install apt packages).
Installing
# add unstable and testing repos sudo sh -c 'echo "deb http://deb.debian.org/debian unstable main" > /etc/apt/sources.list.d/debian-unstable.list' sudo sh -c 'echo "deb http://deb.debian.org/debian testing main" > /etc/apt/sources.list.d/debian-testing.list' # Lower priority of testing and unstable so only used when necessary (Ooptional, to stick as close to Debian stable as you can) sudo sh -c 'printf "Package: *\nPin: release a=testing\nPin-Priority: 60\n\nPackage: *\nPin: release a=unstable\nPin-Priority: 50\n" > /etc/apt/preferences.d/50pinning' sudo apt update Installing with Vulkan backend:
sudo apt install -t testing llama.cpp libggml0-backend-vulkan mesa-vulkan-drivers sudo adduser _llama-server video sudo adduser _llama-server render Installing with ROCm backend:
sudo apt install -t unstable llama.cpp libggml0-backend-hip sudo adduser _llama-server video sudo adduser _llama-server render Both of these installs will install EVERYTHING you need. You don't need anything from AMD, you don't need to manually copy any files, this is all you need. It even creates a systemd service for llama-server, which will read environment variables from /etc/default/llama-server. Model files get downloaded to /var/cache/llama-server/
Here's my /etc/default/llama-server:
#ROCR_VISIBLE_DEVICES=0,1 GGML_VK_VISIBLE_DEVICES=1,2 LLAMA_SET_ROWS=1 LLAMA_ARG_WEBUI=false LLAMA_ARG_THREADS=10 LLAMA_ARG_MODELS_MAX=6 LLAMA_ARG_HOST=0.0.0.0 LLAMA_ARG_PORT=8080 LLAMA_ARG_MODELS_PRESET=/mnt/data1-llama/production_presets.ini "ROCR_VISIBLE_DEVICES" is for ROCm backend, "GGML_VK_VISIBLE_DEVICES" is for Vulkan backend. The numbers may be different if you have additional gpus. I just tried different numbers and used nvtop (sudo apt install nvtop) to see which cards where activated.
Here are the commands I used to run llama-server for each of the benchmarks:
vulkan
sudo -u _llama-server GGML_VK_VISIBLE_DEVICES=1,2 LLAMA_CACHE=/var/cache/llama-server LLAMA_SET_ROWS=1 llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL -sm layer --flash-attn on --host 0.0.0.0 --port 8080 --no-ui --threads 10 --fit on --jinja -ctk q8_0 -ctv q8_0 --batch-size 4096 --ubatch-size 1024 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.10 --ctx-size 262144 Vulkan with mtp:
sudo -u _llama-server GGML_VK_VISIBLE_DEVICES=1,2 LLAMA_CACHE=/var/cache/llama-server LLAMA_SET_ROWS=1 llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q6_K_XL -sm layer --flash-attn on --host 0.0.0.0 --port 8080 --no-ui --threads 10 --fit on --jinja -ctk q8_0 -ctv q8_0 --batch-size 4096 --ubatch-size 1024 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.10 --ctx-size 262144 --spec-type draft-mtp --spec-draft-n-max 3 Rocm
sudo -u _llama-server ROCR_VISIBLE_DEVICES=0,1 LLAMA_CACHE=/var/cache/llama-server LLAMA_SET_ROWS=1 llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q6_K_XL -sm layer --flash-attn on --host 0.0.0.0 --port 8080 --no-ui --threads 10 --fit on --jinja -ctk q8_0 -ctv q8_0 --batch-size 4096 --ubatch-size 1024 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.10 --ctx-size 262144 Rocm with MTP
sudo -u _llama-server ROCR_VISIBLE_DEVICES=0,1 LLAMA_CACHE=/var/cache/llama-server LLAMA_SET_ROWS=1 llama-server -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q6_K_XL -sm layer --flash-attn on --host 0.0.0.0 --port 8080 --no-ui --threads 10 --fit on --jinja -ctk q8_0 -ctv q8_0 --batch-size 4096 --ubatch-size 1024 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00 --repeat-penalty 1.10 --ctx-size 262144 --spec-type draft-mtp --spec-draft-n-max 3 [link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.