r/LocalLLaMA · · 2 min read

I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I’m posting this as a warning for anyone building multi-GPU local LLM rigs with older workstation/HEDT boards.

My setup (Node #04)

  • Gigabyte X399 Designare EX
  • Threadripper 1950X
  • 128GB DDR4
  • 4x RTX 3090
  • 10GbE TP-Link/Aquantia NIC
  • llama.cpp NCCL build
  • vLLM for safetensors models

I was getting weirdly disappointing multi-GPU results. The rig worked, all 4 GPUs were detected, VRAM was available, models loaded, but some workloads were underwhelming.

Example: Mistral Medium 3.5 128B Q4_K GGUF was only doing around 11 tok/s with low GPU usage, roughly 30%.

I assumed it was a backend/model/split/NCCL issue.

Turns out one of the 3090s was sitting in a physical x16 slot that is electrically PCIe 2.0 x4 on this board. Even worse, before fixing BIOS/settings/placement, Linux showed that GPU negotiating as low as Gen2 x1 / Gen1 x4.

The smoking gun:

bash nvidia-smi --query-gpu=index,name,pci.bus_id,pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv

Bad layout showed one GPU effectively crippled.

After moving the cards around, the GPUs now show:

text GPU0: Gen3 max, x8 GPU1: Gen3 max, x16 GPU2: Gen3 max, x8 GPU3: Gen3 max, x16

The hidden mistake was that the board has multiple physical x16-length slots, but not all are electrically equal. The PCIe 2.0 x4 slot belongs to the NIC, not a 3090.

After fixing the slot layout, results changed dramatically.

Qwen3.6 27B BF16 with vLLM TP=4 + MTP at 260K context:

text ~78-80 tok/s generation ~80% draft acceptance rate

Qwen3.6 27B BF16 GGUF with llama.cpp NCCL build, --split-mode tensor, MTP enabled:

text ~66.5 tok/s ~85% draft acceptance

Mistral Medium 3.5 128B Q4_K GGUF with llama.cpp:

Before, using --split-mode layer:

text ~11 tok/s low GPU utilization

After switching to proper PCIe layout and using:

bash --split-mode tensor --tensor-split 25,25,25,25

Result:

text ~24.7 tok/s

So the lessons:

  1. Do not trust physical slot length. Check electrical lane layout in the motherboard manual.
  2. Always verify real negotiated PCIe width/speed from Linux.
  3. nvidia-smi and lspci -vv are your friends.
  4. On llama.cpp, --split-mode layer can badly underuse GPUs for some large GGUF models.
  5. --split-mode tensor made a huge difference for my Mistral 128B GGUF test.
  6. If one GPU is accidentally on a bad PCIe path, the whole multi-GPU inference setup can look like a backend problem when it is actually a slot layout problem.

Useful commands:

bash nvidia-smi topo -m

bash nvidia-smi --query-gpu=index,name,pci.bus_id,pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv

bash for B in 09:00.0 0a:00.0 41:00.0 42:00.0; do echo "===== $B =====" sudo lspci -vv -s "$B" | grep -E "LnkCap|LnkSta" done

If you are building a “cheap VRAM monster” with used 3090s, check this before blaming NCCL, llama.cpp, vLLM, quantization, or the model.

In my case, fixing PCIe slot placement turned the rig from “why is this so underwhelming?” into “okay, this thing is actually a monster.”

submitted by /u/BlackBeardAI
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA