r/LocalLLaMA · · 1 min read

Upgraded my budget build to multi-GPU for inference

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Upgraded my budget build to multi-GPU for inference

I added:

1x RTX 3090 - 610 USD

1x Arc A770 - 222 USD

1x PCIe x1 to 4x USB 3.0 PCIe riser

New cpu cooler

Specs:

Modified Zalman Z9 Plus Case

2x Zotac RTX 3090 24 GB

1x Intel Arc A770 16 GB

48 GB DDR4 RAM

AMD Ryzen 5 1600X

MSI X370 SLI Plus

All parts were purchased second hand except the RAM sticks (before the crisis) and the case. I bought the first RTX 3090 for 540 USD to build this server over a year ago.

Findings after 2 hours of testing:

I thought the Vulkan backend would work well for multi-GPU inference and I could easily mix non-Nvidia GPUs. However, memory overhead is so much worse compared to CUDA. I can run Qwen 3.6 27b Q8_K_XL bf16 cache with 170k context using 2x3090 with CUDA at 30 tokens/s. Tensor split works very well. 3090s are power limited at 275 watts.

There is an extra 5 GB memory overhead per 24 GB card while using Vulkan, which leaves very little space for context. I can run Qwen 3.6 27b Q8_K_XL q8_0 cache with 50k context using 2x3090 + A770 with Vulkan at 3 tokens/s. Yes, 3 tokens per second.

The same model uses 16 GB VRAM with CUDA while it uses 21.7 GB with Vulkan before the kv cache is loaded in an RTX 3090.

Lessons learned:

Vulkan is not good for a multi-GPU setup in llama.cpp. Stick to a single vendor (AMD/Intel/Nvidia) and use their own backend.

submitted by /u/whiteh4cker
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA