R9700 abysmal performance, getting desparate
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I've been trying to get my 2x R9700 setup to work for the past two weeks. This has been such a time sink I wish I had just gone with nvidia. At this point I'm close to selling the cards. I need vLLM. This is a dedicated setup for multi-user serving. I've tried the https://github.com/kyuz0/amd-r9700-vllm-toolboxes and https://github.com/JoergR75/automated-amd-rocm-7.2.4-pytorch-docker-vllm-cdna-rdna-deployment. I've changed operating systems, installed various versions of drivers. I didn't get ANY model working with So what about serving a model with a single card? I get 30tps...with a Qwen 0.6B. 27B INT4 AWQ runs at 5tps (see screenshot). WTF? I've tweaked bios flags, iommu on/off etc. Here's my setup: ``` root@gsrnt:~# python3 test.py 🐧 Ubuntu: Ubuntu 24.04.4 LTS 🔢 Kernel: 6.8.0-124-generic 💻 Installed CPU: AMD Ryzen 5 5600 6-Core Processor 🗄️ Total System-Memory: 63 GB ✅ PyTorch version: 2.12.0+rocm7.2 🧪 ROCm version: 7.2.53211-97f5574fe2 ✅ Is ROCm available: True 🤗 Transformers version: 5.12.1 ⚡ Number of GPUs: 2 ⚡ GPU 0 Name: AMD Radeon AI PRO R9700 💾 Free Memory : 0.00 GB 💾 Total Memory: 31.86 GB 🔌 PCI Device : 0000:06:00.0 🔌 PCIe Width : x16 (max x16) 🚀 PCIe Speed : 32.0 GT/s PCIe (max 32.0 GT/s PCIe) ⚡ GPU 1 Name: AMD Radeon AI PRO R9700 💾 Free Memory : 31.79 GB 💾 Total Memory: 31.86 GB 🔌 PCI Device : 0000:0a:00.0 🔌 PCIe Width : x16 (max x16) 🚀 PCIe Speed : 32.0 GT/s PCIe (max 32.0 GT/s PCIe) ✅ Tensor operation successful on GPU 0 Device: AMD Radeon AI PRO R9700 tensor([[0.8331, 1.1736, 1.7215], [1.2765, 1.2081, 1.5073], [1.1227, 0.7199, 0.8618]], device='cuda:0') ✅ Tensor operation successful on GPU 1 Device: AMD Radeon AI PRO R9700 tensor([[1.4947, 1.1025, 0.9573], [1.3334, 0.8177, 1.1294], [1.1068, 0.9787, 0.9126]], device='cuda:1') ``` The MB is Gigabyte B550-EAGLE. I've ran out of ideas on what else can I verify. If this was a botched motherboard / GPU then I assume tensor operations would not work at all. The first slot is x16, so I should have a decent performance for inference only. I've initially report this over at https://github.com/kyuz0/amd-r9700-vllm-toolboxes/issues/13 - I've since bumped up the system ram to 64gb and it's still just as bad. I've linked more debug logs and host info linked to the issue. If someone could help me figure out what's going on here I'd be grateful. [link] [comments] |
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.