r/LocalLLaMA · June 21, 2026 · 2 min read

R9700 abysmal performance, getting desparate

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

R9700 abysmal performance, getting desparate

I've been trying to get my 2x R9700 setup to work for the past two weeks. This has been such a time sink I wish I had just gone with nvidia. At this point I'm close to selling the cards.

I need vLLM. This is a dedicated setup for multi-user serving. I've tried the https://github.com/kyuz0/amd-r9700-vllm-toolboxes and https://github.com/JoergR75/automated-amd-rocm-7.2.4-pytorch-docker-vllm-cdna-rdna-deployment. I've changed operating systems, installed various versions of drivers.

I didn't get ANY model working with tp=2. It always errors out with RuntimeError: NCCL error: unhandled cuda error.

So what about serving a model with a single card? I get 30tps...with a Qwen 0.6B. 27B INT4 AWQ runs at 5tps (see screenshot). WTF?

I've tweaked bios flags, iommu on/off etc. Here's my setup:

``` root@gsrnt:~# python3 test.py

🐧 Ubuntu: Ubuntu 24.04.4 LTS 🔢 Kernel: 6.8.0-124-generic

💻 Installed CPU: AMD Ryzen 5 5600 6-Core Processor 🗄️ Total System-Memory: 63 GB

✅ PyTorch version: 2.12.0+rocm7.2 🧪 ROCm version: 7.2.53211-97f5574fe2 ✅ Is ROCm available: True 🤗 Transformers version: 5.12.1

⚡ Number of GPUs: 2

⚡ GPU 0 Name: AMD Radeon AI PRO R9700 💾 Free Memory : 0.00 GB 💾 Total Memory: 31.86 GB 🔌 PCI Device : 0000:06:00.0 🔌 PCIe Width : x16 (max x16) 🚀 PCIe Speed : 32.0 GT/s PCIe (max 32.0 GT/s PCIe)

⚡ GPU 1 Name: AMD Radeon AI PRO R9700 💾 Free Memory : 31.79 GB 💾 Total Memory: 31.86 GB 🔌 PCI Device : 0000:0a:00.0 🔌 PCIe Width : x16 (max x16) 🚀 PCIe Speed : 32.0 GT/s PCIe (max 32.0 GT/s PCIe)

✅ Tensor operation successful on GPU 0 Device: AMD Radeon AI PRO R9700 tensor([[0.8331, 1.1736, 1.7215], [1.2765, 1.2081, 1.5073], [1.1227, 0.7199, 0.8618]], device='cuda:0')

✅ Tensor operation successful on GPU 1 Device: AMD Radeon AI PRO R9700 tensor([[1.4947, 1.1025, 0.9573], [1.3334, 0.8177, 1.1294], [1.1068, 0.9787, 0.9126]], device='cuda:1')

```

The MB is Gigabyte B550-EAGLE. I've ran out of ideas on what else can I verify. If this was a botched motherboard / GPU then I assume tensor operations would not work at all. The first slot is x16, so I should have a decent performance for inference only.

I've initially report this over at https://github.com/kyuz0/amd-r9700-vllm-toolboxes/issues/13 - I've since bumped up the system ram to 64gb and it's still just as bad. I've linked more debug logs and host info linked to the issue.

If someone could help me figure out what's going on here I'd be grateful.

submitted by /u/lmyslinski
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA