r/LocalLLaMA · · 2 min read

Findings from troubleshooting p2p on 4x5060 ti bifurcation.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I dumped the last week deep diving this and I’m I’ve been using Linux for 14 years and am a cloud systems engineer with a focus on supported Linux infrastructure for a private cloud provider.

Essentially, if you are using a single 4x4 bifurcation pcie x16 card inserted into your x16 slot on your mobo and you have 4x gpus connected to it. Regardless of pcie generation that card that does the bifurcation is the choke point for p2p communication. It acts as the pcie bridge that connects the gpus and with TP=4 the bandwidth of that fabric that connects the 4 cards on that pci
E bridge will become saturated and yield worse performance than with p2p off. The ways to deal with this would be to either:

  1. Don’t run p2p. It’s only a 10 to 15% gain and may not justify the cost and effort of having a setup where p2p gets you that 10% performance.
  2. Pick up a Chinese slimsas bifurcation bridge. Supposedly you might not encounter it with those. They run between 150 to 250
  3. Buy a 1200 gen 4 pcie bridge from Cpayne. These devices are specifically made for this use case. But 1200 expense for 10% performance gain probably isn’t worth it
  4. Don’t use tensor parallelism. Use pipeline parallelism. The downside with this is pipeline parallelism in my benchmarks yielded worse performance at low concurrency than TP=4 + P2P off. PP=4 only yields better performance if you have significant enough concurrency where all the gpus have something they can be working on where none of them are waiting on another GPU to finish their work
  5. There are used PLX switches on eBay. But with these you run a risk of them not supporting a multi GPU setup with P2P due to firmware restrictions that limit non storage devices being used with them.
  6. Have a motherboard and cpu combo that provides a dedicated x16 lanes to both the primary and secondary x16 slot. You could have both of these with 8i bifurcation with 2 gpus on each. But if that setup requires a retimer to get gen4 or gen 5 then you are talking 130+ for each of these two retimer bifurcation cards.

If there is a solution to this that I didn’t list, please let me know and I’ll update this post.

submitted by /u/joorklee
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA