I think the right way to think about the economics here is either “I would pay $X/hr for this short-lived job” or “I want to compare with buying it” (3-yr committed use discount in our case, RIs / Instance Savings Plan for AWS). Unless you are an ML research lab (Google Brain, FAIR, OpenAI, etc.) or an HPC style site sharing these, you won’t get 100% utilization out of your “I just bought it” purchase. Worse, in ML land, accounting math about N-year depreciation is pretty bogus: if the A100 is 2.5x faster, you’d have been better off with a 1-yr CUD on GCP and refreshing, rather than buying Voltas last year.
One amusing thing that’s not clear about “just buy a DGX” is that many people can’t even rack one of these. At 400 watts per A100, our 16x variant is 6.4 kW of GPUs. That’s before the rest of the system, etc. but there are (sadly) a lot of racks in the world that just can’t handle that.
Makes you wonder what kind of power costs you would incur running one 100% utilisation. Certainly even with best prices, be looking at several thousand a year I would of thought, that's not even factoring in provision. Which would mean 3 phase power for that type of load and then you have to balance out the phases. So many little details that become more an issue when you start getting to datacenter level power usage. Then UPS load/capacity costs/planning, networking. So whilst the costs of these units are high, the other costs that add up, sure do add up fast.
I’d say keeping a spare system in case of failure is probably a bigger deal than the $10k/yr to house it :).
Obviously there are factors going in to that -- I could live without paying the Tesla tax (didn't need to virtualize, didn't need the vram, did need the fp64), I bought used, I didn't have a problem keeping it fed, I didn't need to burst, etc, but my point is that for some GPU workloads the cloud GPUs are really expensive and the break-even utilization is far south of 100%, more like 5%.
The “consumer” parts are certainly popular ML workstations, and rightly so.
It's really expensive, and I think I should lean into buying hardware at this point.
I want to build a high end GPU rig, but was wondering how easy the setup was. I've only built "consumer" systems before (2x 1080Ti). Is there any appreciable difference?
Do you have a single card? Multiple? What motherboard do you use?
Do you have any takeaways or resources you can share?
I'm not using this for machine learning, so you might want to talk to someone who is before pulling the trigger. In particular, my need for fp64 made the choice of Titan V completely trivial, whereas in ML you might have to engage brain cells to pick a card or make a wait/buy determination.
We're in Colovore, which has a fantastic power density and is running roughly 1000k DGXs in their datacenter. It really wasn't all that difficult to get up and running. For us it made total sense, but we utilize physical hardware completely and have to scale into the cloud fairly regularly.
Did you move all your other gear into Colovore? (That’s one of the challenges, you often need to be close to the rest of your systems / data)
We also run at 100% utilization 24/7 and frequently use the cloud for burst before we go out and buy more cards/servers.
Colovore was super easy to get up and running and we will save at a minimum hundreds of thousands of dollars with this setup over exclusively using cloud instances.
Oh, so cloud providers are just giving away free resources? How generous!
Seriously, let's do the math:
If a V100 instance cost $9000 new and you bought it a year ago, you could still sell it today for over $3000. On AWS, an instance on a 1-yr CUD costs more than $1400/month, for a total of over $16000. You don't even need 50% utilization to break even. It doesn't matter how fast the A100 is.
But also, you can rent a single slice of one for tens of seconds and then walk away. I personally probably do about 20-ish GPU hours a month while dabbling (and only because I’m lazy and have my GPUs attached while I’m working, even if it’s just debugging the Python bits). For a V100 that’s $50/month, which is in the noise compared to dealing with owning infrastructure (and would be even less if I’d get by with a T4).
Still comes out cheaper than cloud offerings.
Intel Xeon Gold 5120 14Core @ £1200 used
renting 1Core @ £1.20/mo.
would take 6 years to fully pay for that single core alone, and that's excluding the 512MB RAM, 10GB SSD and unlimited 400Mb/s bandwidth
it might throw the comparison off if there was variation in performance at different times, but there isn't
cores scale with RAM, you cannot rent 32 with 512MB
Hetzner dedicated cores come with 4gb of memory per vCPU for reference.
1.2 GBP/mo for a dedicated core is wildly off market price. There's no way they'd be able to make a profit, so either they're going out of business tomorrow or it's not dedicated.
The same thing happens with these large instances. Because these instances are so much bigger that you won't use them for the entire month. You'll use them to get results within hours or days.
I've also read that the A100 can deliver 4x the performance on some DL workloads.