Cost Analysis of my $6.4k Local LLM Server
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I haven't seen any of these done, so I just wanted to share my experience in case it is useful for anyone. The purpose of this post is to show total cost of ownership of my local llm server versus API equivalent. Before you look at the final numbers, note that most people do not do proper financial accounting of hardware. Most people treat hardware as a fully depreciated cost, when in fact hardware typically depreciates slowly or in some cases appreciates over time. This significantly changes the TCO results and explains why the number at the bottom is better than what other people mention.
Hardware
First off here are the shipped hardware prices:
- Used 4x MI100 32GB: $4234.82
- New ASRock EPYCD8-2T: $721.61
- New 1600W 80+ Plat PSU: $497.95
- Used 8x8GB DDR4 ECC RDIMMs: $348.79
- Used Epyc 7k62 48 core CPU: $254.28
- New CPU Cooler: $167.31
- New ATX Case: $132.43
- 4x SATA to USB power cables for blowers: $28.56
- 4x 75x30mm Blowers for GPUs: $13.76
- Plastic sheet for blower fab: $6.94
- Storage is a 1TB M.2 drive I had laying around: Free
Total Price: $6406.45
Configuration
The server is currently configured with four separate instances of llama.cpp running Qwen3.6 27B. It is running on Ubuntu with the latest ROCm. It has a low power profile on all components, and in its current workload it is able to process 20.4M input tokens and 1.32M output tokens per day. I do actually use all of this token capacity for a business process. The token output is lower than I expected, and I'll address that in the notes below.
Equivalent API Cost
Qwen3.6 27B currently costs $0.29/M input tok and $3.2/M output tok on OpenRouter. This means that its current processing is worth $5.92 input and $4.22 output per day, totalling $10.14 per day.
Expanding this to a year, API equivalent is $3701.10. Per month that's $308.43.
API Cost: $3701.1 per year
Equivalent in Coding Plans
I thought I'd throw this in here because its hard to quantify otherwise and might be useful. I also use the Z.AI coding plan as an API provider for this same business process. Because of that, I can measure how much they end up giving you in tokens and produce fairly comparable results. I have ZAI's best plan, which is currently $144/mo, and it is allowing me about 4.5M input tokens and 200k output tokens of GLM 4.7 per day. GLM 4.7 is actually a less expensive model on OpenRouter than Qwen3.6 27B believe it or not, and in many benchmarks they are comparable, so this is a more fair comparison than I'd have expected.
Normalizing this, it would cost about $652.8 per month for the same capacity via this plan, or $7833.60 per year. This is more than double the same amount of GLM 4.7 use via OpenRouter or the API cost of Qwen3.6 27B.
So word of caution, the coding plans aren't always a good value. Make sure you know what you're paying for. I actually paid much less for this plan when they were running specials at the start of the year, so it works out better for me, but I certainly won't renew my sub once the year expires.
Local LLM Costs
Electricity
I configured the server with low power profiles, so at full LLM load the whole server is consuming 630 watts at the wall. This translates to 15.1 kwh per day, and at $0.14 per kwh, that is $2.11 to run per day. $0.14 is a worst case for me, with actual cost being more like $0.08 including off hours and winter rates, but its difficult to calculate an accurate estimate so I chose to keep it very conservative.
Expanding that higher rate to a year my Local LLM server costs $770.15 for elec.
Local LLM Cost: $770.15 per year or $64.18 per month
Hardware Depreciation
Next, depreciation is an accounting term which represents how much something loses value over time. Cash accounting like most people are familiar with is not actually accurate because if you own an asset it still has value that can eventually be liquidated to recover part of its price. Depreciation shows you the cost of owning something over time in terms of how much you'd lose if you sold it at that time.
For the hardware, lets say all accessories fully depreciate (total loss), new components depreciate 50%, and used components depreciate 10%.
- Accessories: $349 * 100% = $349
- New components: $1219.56 * 50% = $609.78
- Used components: $4837.89 * 10% = $483.79
I think its reasonable to say this depreciation will be roughly the same one day after purchase or 5 years after purchase. So basically this is a one-time cost that only slightly increases over time.
Local LLM Cost: $1442.57 1-time
Infrastructure
To make it so the server had reliable power that wasn't impacted by other devices in my house, and could withstand startup surge power, I had a new dedicated electricity circuit run to a new 20 amp breaker. This cost $780 for a pro to do. This isn't entirely necessary, but I felt like it was a good idea long term because the system is possibly capable of saturating a 15 amp circuit.
I already have a homelab with switch, router, and shelving, so this was free for me. I was able to keep power usage to a reasonable level so I don't need extra HVAC. System labor is free because I'm doing it and I enjoy working on computers.
Local LLM Cost: $780 1-time
Total Local LLM Cost & Savings
Adding all that up for my Local LLM setup, the first year's costs arrive at $2992.72. Once again, that is cost not cash outlay. API costs are $3701.1 per year, so this represents a first year savings of $708.38. For subsequent years the operating cost of the local LLM server is $770.15, representing $2930.95 savings assuming API costs stay the same (they will not, but this is for illustration purposes).
- First year Local LLM Server cost: $2992.72
- Subsequent year Local LLM Server cost: $770.15
- API Cost: $3701.1
- First year savings: $708.38
- Subsequent year savings: $2930.95
Notes
I mentioned that token output is lower than I expected. While I am running a low power profile on these cards, benchmarking showed that they are running at about 70% of the speed of full power. In other words, full power produces around 43% more tokens. That is still under what I was expecting. I think it can generally be explained by the MI100 being a rare card, and it being poorly optimized for in all major LLM software. So even though they have pretty good raw specs, its not delivering what I hoped for. I would say around double the performance is what I was hoping for, as that's the performance of my 7900 XTX which has similar raw specs.
The main reason I got MI100s was because of their ability to use a 3-way interlink bridge. Unfortunately there is next to no documentation out there about these bridges, and I couldn't get it to work with my motherboard after spending days working on it, so I ultimately chose to return it. This was the largest disappointment with this system because the interlinks would have been a big edge with mid-size models. As far as I can tell though, the bridge requires very specific PCIE architecture that only a set of supported motherboards from their deployed systems provide.
I would say if I were to do a do over, I'd probably go with prosumer cards like the R9700 or a unified memory setup like a couple DGX sparks. I'd expect them just to be easier all around to work with and give me more options long term. I do have a strix halo laptop, and that type of device (including sparks and apples here) is ultimately an excellent option especially for mid-size models that will hit PCIE in a GPU setup. If you are planning on going with a mid-size model, I'd strongly recommend stacking those type of devices instead of going the way I did because they are quite fast once you start taking into account PCIE and to top it off also use very low power which reduces your elec bill meaningfully.
Hope this helped!
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.