I accidentally crippled my 4x RTX 3090 LLM rig with a hidden PCIe 2.0 x4 slot and fixing it doubled Mistral 128B performance
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I’m posting this as a warning for anyone building multi-GPU local LLM rigs with older workstation/HEDT boards.
My setup (Node #04)
- Gigabyte X399 Designare EX
- Threadripper 1950X
- 128GB DDR4
- 4x RTX 3090
- 10GbE TP-Link/Aquantia NIC
- llama.cpp NCCL build
- vLLM for safetensors models
I was getting weirdly disappointing multi-GPU results. The rig worked, all 4 GPUs were detected, VRAM was available, models loaded, but some workloads were underwhelming.
Example: Mistral Medium 3.5 128B Q4_K GGUF was only doing around 11 tok/s with low GPU usage, roughly 30%.
I assumed it was a backend/model/split/NCCL issue.
Turns out one of the 3090s was sitting in a physical x16 slot that is electrically PCIe 2.0 x4 on this board. Even worse, before fixing BIOS/settings/placement, Linux showed that GPU negotiating as low as Gen2 x1 / Gen1 x4.
The smoking gun:
bash nvidia-smi --query-gpu=index,name,pci.bus_id,pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv
Bad layout showed one GPU effectively crippled.
After moving the cards around, the GPUs now show:
text GPU0: Gen3 max, x8 GPU1: Gen3 max, x16 GPU2: Gen3 max, x8 GPU3: Gen3 max, x16
The hidden mistake was that the board has multiple physical x16-length slots, but not all are electrically equal. The PCIe 2.0 x4 slot belongs to the NIC, not a 3090.
After fixing the slot layout, results changed dramatically.
Qwen3.6 27B BF16 with vLLM TP=4 + MTP at 260K context:
text ~78-80 tok/s generation ~80% draft acceptance rate
Qwen3.6 27B BF16 GGUF with llama.cpp NCCL build, --split-mode tensor, MTP enabled:
text ~66.5 tok/s ~85% draft acceptance
Mistral Medium 3.5 128B Q4_K GGUF with llama.cpp:
Before, using --split-mode layer:
text ~11 tok/s low GPU utilization
After switching to proper PCIe layout and using:
bash --split-mode tensor --tensor-split 25,25,25,25
Result:
text ~24.7 tok/s
So the lessons:
- Do not trust physical slot length. Check electrical lane layout in the motherboard manual.
- Always verify real negotiated PCIe width/speed from Linux.
nvidia-smiandlspci -vvare your friends.- On llama.cpp,
--split-mode layercan badly underuse GPUs for some large GGUF models. --split-mode tensormade a huge difference for my Mistral 128B GGUF test.- If one GPU is accidentally on a bad PCIe path, the whole multi-GPU inference setup can look like a backend problem when it is actually a slot layout problem.
Useful commands:
bash nvidia-smi topo -m
bash nvidia-smi --query-gpu=index,name,pci.bus_id,pcie.link.gen.current,pcie.link.width.current,pcie.link.gen.max,pcie.link.width.max --format=csv
bash for B in 09:00.0 0a:00.0 41:00.0 42:00.0; do echo "===== $B =====" sudo lspci -vv -s "$B" | grep -E "LnkCap|LnkSta" done
If you are building a “cheap VRAM monster” with used 3090s, check this before blaming NCCL, llama.cpp, vLLM, quantization, or the model.
In my case, fixing PCIe slot placement turned the rig from “why is this so underwhelming?” into “okay, this thing is actually a monster.”
[link] [comments]
More from r/LocalLLaMA
-
You guys were right - Qwen 3.6 35B IS good...and KV Cache DOES matter.
Jun 4
-
Run (your largest) local models from your iPhone
Jun 4
-
Nemotron 3 Ultra. 550 billion parameters, 55B active. 1 million context
Jun 4
-
The DeepSWE benchmark was runned rather incompetently and the results are completely invalid
Jun 4
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.