If you had $150K for building a production-class local inference server to serve 300 people, what would you buy?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I know we usually focus on home lab stuff here for the most part, but I’m in a position where I’m trying to purchase a failover server for our production inference server for under $150K. Our main production server has 4 H100s, so I’m looking for something that is close to equivalent with that performance and capacity wise (if possible). Obviously H100s are reaching the end of their product cycle, so I figure that there should be something newer that performs as good, if not better at hopefully a reasonable price point. I understand that we’re at the worst possible time in history to buy any hardware right now. I can’t really afford to wait until the market gets better unfortunately.
I’m looking for the best bang for the buck for inference right now. I thought about looking into a DGX Station and using it for inference, but I can’t really find them anywhere available for purchase yet. So my second thought was to maybe get a SuperMicro rack server with like 4 RTX Pro 6000s in it. Is that my best option for serving local models with vLLM to a few hundred people? Production for us is running 122b AWQ models at 256k context with a TP of 2 on vLLM. So I’m looking for something that can handle that and more preferably. We also run a small embedding model on the same server.
I know $150K ain’t gonna go as far as it used to. What would you guys suggest in this situation?
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.