Upgraded my budget build to multi-GPU for inference
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I added: 1x RTX 3090 - 610 USD 1x Arc A770 - 222 USD 1x PCIe x1 to 4x USB 3.0 PCIe riser New cpu cooler Specs: Modified Zalman Z9 Plus Case 2x Zotac RTX 3090 24 GB 1x Intel Arc A770 16 GB 48 GB DDR4 RAM AMD Ryzen 5 1600X MSI X370 SLI Plus All parts were purchased second hand except the RAM sticks (before the crisis) and the case. I bought the first RTX 3090 for 540 USD to build this server over a year ago. Findings after 2 hours of testing: I thought the Vulkan backend would work well for multi-GPU inference and I could easily mix non-Nvidia GPUs. However, memory overhead is so much worse compared to CUDA. I can run Qwen 3.6 27b Q8_K_XL bf16 cache with 170k context using 2x3090 with CUDA at 30 tokens/s. Tensor split works very well. 3090s are power limited at 275 watts. There is an extra 5 GB memory overhead per 24 GB card while using Vulkan, which leaves very little space for context. I can run Qwen 3.6 27b Q8_K_XL q8_0 cache with 50k context using 2x3090 + A770 with Vulkan at 3 tokens/s. Yes, 3 tokens per second. The same model uses 16 GB VRAM with CUDA while it uses 21.7 GB with Vulkan before the kv cache is loaded in an RTX 3090. Lessons learned: Vulkan is not good for a multi-GPU setup in llama.cpp. Stick to a single vendor (AMD/Intel/Nvidia) and use their own backend. [link] [comments] |
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.