Re: IaaS loves to blame the customer
I don't agree there at all. Good infrastructure management is good management. Having a properly designed facility is a good start. Well trained, knowledgeable staff is also important. Having and following standards is also important.
That Fisher plaza in Seattle at the time as far as I recall had issues such as:
* Staff not replacing UPS batteries before they expired
* Not properly protecting the "Emergency Power off" switch (which was one power incident a customer pressed it to find out what would happen, after that all customers required "EPO Training")
* Poor design led to a fire in the power room years after I moved out which caused ~40 hours of downtime and months of running on generator trucks parked outside. A couple years later I saw a news report of a similar fire at a Terremark facility, in that case they had independent power rooms, and there was zero impact to customers.
* Don't recall the causes of other power outages there if there were any other unique causes.
Another facility I was hosted in Amsterdam had an insufficient power design as well, and poor network policies
* The network team felt it was perfectly OK to do maintenance on the network, including at one point taking half of their network offline WITHOUT TELLING CUSTOMERS. They fixed that policy after I bitched enough. My normal carrier of choice is Internap, which has a 100% Uptime SLA, and has been excellent over the past 13 years as a network customer. Internap was not an option in Amsterdam at the time so we went with the facility's internet connection which was wired into the local internet exchange.
* At one point they told customers they had to literally shut off the "A" power feeds to do something, then the following week they had to shut off the "B" power feeds to do that thing to the other side, don't recall what it was, but obviously they didn't have the ability to do maintenance without taking power down (so am guessing no N+1). No real impact to either event on my end, though we did have a few devices that had only 1 PSU(with no option on those models for a 2nd), so we lost those, however they had redundant peers so things just failed over. In nearly 20 years of co-location only that facility ever had to take power down for maintenance.
One company I was at moved into a building (this was 18 years ago) that was previously occupied by Microsoft. We were all super impressed to see the "UPS Room", it wasn't a traditional UPS design from what I recall, just tons of batteries wired up in a safe way I imagine. They had a couple dozen racks on site. Wasn't till later the company realized most/all of the batteries were dead so when they had a power outage it all failed. None of that stuff was my responsibility, all of my gear was at the co-location.
My first data center was in 2003, an AT&T facility. I do remember one power outage there, my first one, I recall I was walking out of the facility and was in the lobby at the time when the lights went out. I remember the on site staff rushing from their offices to the data center floor and they stopped to assure me the data center floor was not affected(and it wasn't). Power came back on a few minutes later, don't recall if it was a local issue to the building or if it was a wider outage.
My first server room was in 2000. I built it out with tons of UPS capacity and tons of cooling. I was quite proud of the setup, about a dozen racks. Everything worked great, until one Sunday morning I got a bunch of alerts from my UPSs saying power was out. Everything still worked fine but about 30 seconds later I realized that while I have ~45min of UPS capacity I have no cooling right now so I rushed to the office to do graceful shutdowns of things. Fortunately things never got too hot I was able to be on site about 10 mins after the power went out. There was nothing really mission critical there, it was a software development company and the majority of the gear was dev systems, the local email server(we had 1 email server per office) and a few other things were there as well.
There are certainly other ways to have outages, I have been on the front lines of 3 primary storage array failures in the last 19 years, arrays which had no immediate backup so all of the systems connected to the arrays were down for hours to days for recovery. And I have been in countless application related outages as well the worst of which date back 18 years ago an unstable app stack being down for 24+ hours and the developers not knowing what to do to fix it. At one point there we had Oracle fly on site to debug database performance issues too. I've caused my own share of outages over the years though I probably have a 500:1 ratio of outages I've fixed or help fix vs outages I caused.
My original post, in case it wasn't clear, was specific to facility availability and to a lesser extent network uplink availability.