Re: Unbelievable
I'm assuming this is one of Google's own cloud data centers and not a 3rd party colo they are using.. but cloud data centers are generally built less reliable to save costs, and you get the redundancy by having systems in multiple zones/regions rather than higher availability in a single facility. The article makes it sound like just a single zone went down in the affected region?
I'm not sure how a battery failure would not take anything down if there was no other form of power at the time. Utility power was gone(perhaps connected to only a single source of power), and it seems likely they just have a single bank of UPS(s) that failed. Reason could be anything including not replacing the batteries when they expired.
Better engineered facilities would fare much better but also cost quite a bit more. Too many people assume just because it's cloud that everything from the ground up is designed/deployed as well as robust as more traditional facilities/systems.
The facility where I host my personal equipment is old, I think from the 90s, single power feed, single UPS system(as far as I know), no redundant power anywhere in the facility. They have had quite a few power outages over the past decade(more than my home) though none in the last 3 years. It's good enough and pretty cheap so not a big deal to me...
The facility where I've hosted the gear for the orgs I have worked for over the past decade plus is by contrast far better with N+1 everything, they do tons of regular testing, have at least two power feeds to the facility, I can only recall one occasion where they went on generator power. Though have seen several notifications around UPS failures here and there, though since everything is N+1 redundancy was never impacted. Never so much as a blip in power feeds distributed to the racks themselves.
There is a facility that my previous org hosted stuff in Europe, I hated that place so much. At one point they needed to make some change to their power systems, and to do that they had to take half of their power offline for several hours, then a few days later did the same process on the other half. Obviously no N+1 there, as we did lose power on each of our circuit pairs during the maintenance. Though no real impact to us as everything was redundant(did lose a couple of single power supply devices during the outage but there was other units that took over automatically). So many problems with that facility and their staff & policies was so happy to move out.
Back in mid 2000s I was hired at a company that had stuff hosted in a facility that suffered the most outages of any I've ever seen, and the only other full facility power outages, bad power system design. Power outage causes included dead UPS batteries that were not replaced and UPSs failed when the power cut, as well as a customer at one point intentionally pressed the "emergency power off" button to kill power to the whole facility because they were curious. After that 2nd event all customers had to go through "training" about that button and sign off on getting that training. There was probably 3 power outages in less than 1 year at that facility and by the time the 3rd one happened I was ready to move out just needed VP approvals and the approval came fast when that 3rd outage hit. The facility later suffered an electrical fire a few years later and was completely down for about 40ish hours, till they got generator trucks on site to power back up, took them probably 6 months to repair the damage to the power system. Bad design...though I recall that facility being highly touted as a great place to be during the dot com era..
By contrast I recall reading a news article around that same time about another facility, that was designed properly, had a similar electrical fire on their power system, and it had zero impact to customers. Part of their power system went down but due to the proper redundancy, everything stayed online. I recall a comment where they said often times a fire department would require full shutdown to safely fight the fire, but they were able to demonstrate they had isolated that part of the system so they were not required to power down.
On that note I'd never host in a facility that uses flywheel UPS, and/or who doesn't have tech staff on site 24/7 to handle basic issues during a power outage (like the automatic switch to generators not working). Flywheel UPSs don't give enough time(usually less than 1 minute) for a human to respond. Would like to see at least 10-15mins of battery runtime capacity(hopefully only need less than a minute for generators to start).