Looking at clouds
They have a tendancy to evaporate.
They have their place I grant you but then this article.
IBM cloud has experienced a significant Severity One outage – the rating Big Blue uses to denote the most serious incidents that make resources in its cloud unavailable to customers. The impact was indeed severe: IBM stated that users might not be able to access its catalogue of cloudy services or provision affected services …
Once upon a time IBM was the reference in business stability and reliability.
Then it fired everyone with a clue because costs, since its own management didn't have a clue.
Now, we get this IBM that can't even manage its own cloud properly, not to mention its own internal mail upgrade.
Frankly, anybody using IBM Cloud deserves everything they get. Yes, Cloud is obviously difficult, but IBM killed every excuse it could possibly have with its endless layoffs of experience.
You reap what you sow.
When I moved my current org out of public cloud a decade ago the costs were the primary selling point to management (ROI of 8 months according to my manager at the time). But for me it was not only costs but it was less headache having complete control over the infrastructure(aside from power and internet, our internet provider has a 100% SLA for their network, haven't suffered any power issues at our main data center since we moved in, a data center we used in Amsterdam had minor power issues a few years ago but everything with dual power supplies stayed online, and we don't have anything hosted in europe anymore). Despite 100% SLA on the network of course there are caveats, things like DDoS related outages are not covered(fortunately they are super rare, we've never been attacked but have on rare occasion been collateral damage).
Random outages, performance problems, general WTF moments, seemed to be endless points of stress when using public cloud because in part they are always messing with it. And those were just the "small scale" issues/anxiety that don't make news stories like this one. Things like this it's hard to put a $ figure on. You can work around some of it by making your app super resilient, but as a recent cloud/devops survey reported here on el reg, most companies have not done that(as I predicted as much back in 2010). It's not easy or cheap to do.
Maybe we have been lucky I am not sure, but can count on probably one hand I think the number of hard production VM server failures we've had in the past 5-6 years(oldest production server probably from 2015, oldest non production 2013). Zero production storage failures(oldest array online since 2014), zero networking(oldest networking device online since 2011). Just so damn reliable, I mean it's even exceeded my expectations(by a big margin). When a VM server fails almost every time the VMs are auto restarted elsewhere faster than even the alerts can come in notifying they are down. Super rare to need manual intervention on anything during that kind of event, all state retained. No data lost(other than perhaps some inflight transactions).
Of course outside of a few folks this level of reliability is not well recognized in my opinion(as is expected I suppose). A decent part of the success in my opinion is keeping things simple. The more complex you get the more likely you hit bugs. Of course if you have really good testing you can probably catch them early, but no place I have worked at in the past 20 years had really good testing.
I recall one quote from a QA director many years ago at my first "SaaS" (before that was a term) company who said in a meeting something along the lines of "If I had to sign off on any build going to production we'd never ship anything". Another quote from my time there from the director of engineering during a big outage "Guys, is there anything I can buy that would make this problem go away?" (the answer was no, and that answer didn't come from me I was just following along). Fun times.
This post has been deleted by its author