back to article Google Cloud’s so-called uninterruptible power supplies caused a six-hour interruption

Google has revealed that a recent six-hour outage at one of its cloudy regions was caused by uninterruptible power supplies not doing their job. The outage commenced on March 29th and caused “degraded service or unavailability” for over 20 Google Cloud services in the us-east5-c zone. Google’s US east zone is centered on …

  1. Paul Crawford Silver badge
    Facepalm

    You had just one job to do...

    ...and that is keeping power on. I wonder who the UPS vendor is?

    From bitter experiences we had some Dell-branded APC 5kVA UPS and they were useless, often failing when tested and failed-hard, internal bypass also broken. Out of 5 in total they were all dead within 2 years of operation. APC now owned by Schneider, so I wonder if that crap has appeared elsewhere?

  2. Tubz Silver badge

    APC has gone downhill badly with Schneider, currently using Eaton and actually had to tweak reporting as it was telling me too much information, that was not truly BAU requirement.

    1. cyberdemon Silver badge
      Pint

      This.

      I bought a little Schneider UPS from scamazon, but so far it has had a lower reliability factor than my house's electricity supply (quite an achievement, given my over-sensitive RCD and its propensity to trip if I scorch a naan bread in the toaster).. It is is currently bypassed because it threw a wobbly for no apparent reason yesterday (continuous tone, light flashing, no way to shut it up without shutting everything down - it was only on about 1/3 load).

      Curious as to what counts as "Too much information" from your Eaton UPS though? Was it talking about its piles?

      1. Will Godfrey Silver badge
        Stop

        Were they swollen?

    2. Paul Crawford Silver badge

      We had Dell-branded Eaton a while back, generally quite good models but eventually end of life failures after a decade or so of use and at least one battery change.

      Currently using Riello which have proven reliable so far (touch wood!) but the web interface sucks a bit.

  3. Will Godfrey Silver badge
    Facepalm

    Unbelievable

    A battery failure should never take anything else down!

    I can remember seeing a bit of military kit in the 1960s (government surplus place in Reading) that solved the problem quite simply. There were two identical lead/acid batteries with series fuse and diode, so whichever had the highest voltage delivered the current (although I suspect the battery internal resistances meant they both delivered some current). Recharging/trickle balance was done with separate fuse and current limiting resistor combos. These days I would have thought a UPS for something as important as a major data centre would have done very considerably better.

    P.S. There was a guy who worked there who was very informative and helpful to us youngsters. Can't remember his name after all these years :(

    1. Natalie Gritpants Jr

      Re: Unbelievable

      All well and good till you discover the diodes are backwards

      1. Will Godfrey Silver badge
        Coat

        Re: Unbelievable

        Shirley, nobody would do that

    2. Nate Amsden

      Re: Unbelievable

      I'm assuming this is one of Google's own cloud data centers and not a 3rd party colo they are using.. but cloud data centers are generally built less reliable to save costs, and you get the redundancy by having systems in multiple zones/regions rather than higher availability in a single facility. The article makes it sound like just a single zone went down in the affected region?

      I'm not sure how a battery failure would not take anything down if there was no other form of power at the time. Utility power was gone(perhaps connected to only a single source of power), and it seems likely they just have a single bank of UPS(s) that failed. Reason could be anything including not replacing the batteries when they expired.

      Better engineered facilities would fare much better but also cost quite a bit more. Too many people assume just because it's cloud that everything from the ground up is designed/deployed as well as robust as more traditional facilities/systems.

      The facility where I host my personal equipment is old, I think from the 90s, single power feed, single UPS system(as far as I know), no redundant power anywhere in the facility. They have had quite a few power outages over the past decade(more than my home) though none in the last 3 years. It's good enough and pretty cheap so not a big deal to me...

      The facility where I've hosted the gear for the orgs I have worked for over the past decade plus is by contrast far better with N+1 everything, they do tons of regular testing, have at least two power feeds to the facility, I can only recall one occasion where they went on generator power. Though have seen several notifications around UPS failures here and there, though since everything is N+1 redundancy was never impacted. Never so much as a blip in power feeds distributed to the racks themselves.

      There is a facility that my previous org hosted stuff in Europe, I hated that place so much. At one point they needed to make some change to their power systems, and to do that they had to take half of their power offline for several hours, then a few days later did the same process on the other half. Obviously no N+1 there, as we did lose power on each of our circuit pairs during the maintenance. Though no real impact to us as everything was redundant(did lose a couple of single power supply devices during the outage but there was other units that took over automatically). So many problems with that facility and their staff & policies was so happy to move out.

      Back in mid 2000s I was hired at a company that had stuff hosted in a facility that suffered the most outages of any I've ever seen, and the only other full facility power outages, bad power system design. Power outage causes included dead UPS batteries that were not replaced and UPSs failed when the power cut, as well as a customer at one point intentionally pressed the "emergency power off" button to kill power to the whole facility because they were curious. After that 2nd event all customers had to go through "training" about that button and sign off on getting that training. There was probably 3 power outages in less than 1 year at that facility and by the time the 3rd one happened I was ready to move out just needed VP approvals and the approval came fast when that 3rd outage hit. The facility later suffered an electrical fire a few years later and was completely down for about 40ish hours, till they got generator trucks on site to power back up, took them probably 6 months to repair the damage to the power system. Bad design...though I recall that facility being highly touted as a great place to be during the dot com era..

      By contrast I recall reading a news article around that same time about another facility, that was designed properly, had a similar electrical fire on their power system, and it had zero impact to customers. Part of their power system went down but due to the proper redundancy, everything stayed online. I recall a comment where they said often times a fire department would require full shutdown to safely fight the fire, but they were able to demonstrate they had isolated that part of the system so they were not required to power down.

      On that note I'd never host in a facility that uses flywheel UPS, and/or who doesn't have tech staff on site 24/7 to handle basic issues during a power outage (like the automatic switch to generators not working). Flywheel UPSs don't give enough time(usually less than 1 minute) for a human to respond. Would like to see at least 10-15mins of battery runtime capacity(hopefully only need less than a minute for generators to start).

      1. Anonymous Coward
        Anonymous Coward

        Re: Unbelievable

        "There is a facility that my previous org hosted stuff in Europe, I hated that place so much. At one point they needed to make some change to their power systems, and to do that they had to take half of their power offline for several hours, then a few days later did the same process on the other half. Obviously no N+1 there, as we did lose power on each of our circuit pairs during the maintenance."

        This sounds like one of the DCs hosting LINX (London Internet Exchange), the DC had a power outage one time and it was due to a faulty Automatic Transfer Switch (to switch power between grid and UPS/generators, so an "intentional" SPOF). As part of the aftermath of that (partial DC outage) they discovered a design flaw in that brand of ATF and as they used several of these in different parts (floors?) of the DC several scheduled complete power outages had to be arranged in order to replace the other ATFs.

        "On that note I'd never host in a facility that uses flywheel UPS,"

        I don't believe I've ever actually seen a flywheel UPS, however I'd not feel comfortable being close to one due to the risks if they were to catastrophically fail (AFAIK large flywheels tend to be underground or buried/partially buried for safety reasons)

  4. Timop

    The UPS was literally uninterrupted during power outage.

    1. IanRS

      Procurement error

      They bought the Unavailable Power Supply instead.

      1. David 132 Silver badge
        Happy

        Re: Procurement error

        No no, they’re “uninterruptible” as in, “I’m having a lie-down, don’t interrupt me”.

  5. An_Old_Dog Silver badge
    Joke

    Chicken-and-Egg

    Cloud-resident UPS-battery monitoring software?

    1. rg287 Silver badge
      Joke

      Re: Chicken-and-Egg

      Cloud-resident UPS-battery monitoring software?

      Using AI battery condition monitoring algorithms. All written to the blockchain for data integrity!

      1. Paul Crawford Silver badge
        Joke

        Re: Chicken-and-Egg

        And...that was the reason for the outage folks, the AI monitoring system consumed more power then the UPS....

  6. An_Old_Dog Silver badge

    When was a UPS System *Scheduled* Test Last Conducted?

    See title.

    1. I could be a dog really Silver badge

      Re: When was a UPS System *Scheduled* Test Last Conducted?

      The problem is that you can run schedules tests, but they usually don't actually test for what actually happens when the mains supply fails. For example, I used to manage one which (amongst many failings) didn't actually transfer all the load to battery when testing - so it gave an unrealistic indication of battery condition.

      Going back much further, I had one which if you pulled the plug out of the wall would fall over and die - you had to plug some other power using equipment in and cut the power (upstream) to both so that the mains collapsed properly to trigger it to work.

      The only true way to test is to switch off the upstream power and see what happens, but for some reason, business people aren't keen on that idea !

  7. Tyraelos

    Probably APC…..

    I run a number of large APC UPSes throughout my home and the main issue I’ve found with some of them is no warning when the battery is low or has died. They are supposed to warn through their LCD screens, software and failed self tests and they absolutely do not. The only way I find out I need to replace the battery is when power dies I hear a few beeps and down goes the UPS. Sounds like this is what happened with these units. Oddly it seems the older units work better than the newer ones, it’s truly peculiar.

    1. ChrisElvidge Silver badge

      Re: Probably APC…..Older units

      I had several APC units 20 or so years ago. Never a problem.

      Put in new batteries regularly (motorbike 6v lead acids) even though APC said the batteries were not consumer replaceable. Just had to recalibrate after battery changes.

      1. ecarlseen

        Re: Probably APC…..Older units

        APC was amazing until they were bought out by Schneider Electric, after which quality dropped straight into the toilet.

    2. Anonymous Coward
      Anonymous Coward

      Re: Probably APC…..

      My current APC is a good number of years old, and still lasts an hour and a half or so when running gateway, router, and (very small) server. Replaced the battery in it once. I think it didn't even require tools, just sliding the cover off the back and swapping batteries.

      The former UPS was a Cyberpower, where the first hint of a problem with it was when power finally died - and it quit in the same instant. Replacing the battery required a full disassembly. It eventually refused to provide even pass-through power, which is when it got replaced.

  8. dubious
    Mushroom

    sounds about right

    Over the years I've probably had more power outages caused by UPS failures than I have had UPSes saving the DC from a power feed failures.

    1. I could be a dog really Silver badge

      Re: sounds about right

      Back when I was in that line of work, I would try to arrange that with dual-supply machines (e.g. dual supply servers) one supply went via the UPS, and the other didn't. That way, when the UPS crapped out, it didn't take everything down with it. Downside is that any battery test/runtime calibration would be inaccurate.

      Another thing I'd like to do was only use one power outlet on the UPS and a separate multi-way distribution board. That way, if the UPS craps out, you can pull it's supply cable and outlet cable, and plug them together to get stuff working. A lot easier than trying to find all the leads needed when you've half a dozen bits of kit plugged into C13 sockets directly on the back of the UPS.

  9. Excused Boots Silver badge

    Can’t help wondering if this will be the subject of a ‘Who Me?’ article in a few months’ time?

  10. Anonymous Coward
    Anonymous Coward

    It's Not Just Power

    Worked at a place where to UPS was regularly tested. Then we had a burst water line - and came mighty close to losing everything.

    Why? A room full of servers can get quite toasty, and our HVAC system used water towers to regulate temperature. No water, no cooling, thermal overload.

    Do YOU have a water tank, big enough to last 12 hours? With a contract to refill it, in extended outages? I know a place that does (now).

    1. jake Silver badge

      Re: It's Not Just Power

      "Do YOU have a water tank, big enough to last 12 hours?"

      Sure. My GSHP has been running nearly non-stop for about a quarter century.

      Why the nearly? I had to swap out the battery a couple times (was lead-acid/PV, now LiFePO/PV, backup is a 35 year old 4kV Generac, which is way overkill, but handy for extra power in an emergency).

    2. I could be a dog really Silver badge

      Re: It's Not Just Power

      And there's the classic - computers on UPS, cooling isn't. So the computers keep running, but very quickly have to shut down (or automatically power off) as the server room cooks.

      Over the years, I've had to explain to many clients that "no, that small cupboard under the stairs (or whatever they have in mind) isn't ideal for the servers - it's only nice and cool now because it doesn't have a fan heater running 24x7 in it". I reckon cooling has to be one of the most under-considered aspects of running IT systems.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like