back to article Vanishing power feeds, UPS batteries, failover fails... Cloudflare explains that two-day outage

Cloudflare has explained how it believes it suffered that earlier multi-day control plane and analytics outage. The executive summary is: a datacenter used by Cloudflare apparently switched with little or no warning from operating on two utility power sources, to one utility source and emergency generators, to just UPS …

  1. A Non e-mouse Silver badge

    Repeat after me: An untested backup is a worthless backup.

    1. theblackhand

      Would testing have helped here?

      My reading of the power situation was that IF the data centre had used it's generators for powering just the data centre, the earth issue likely would not have happened.

      Only the data centre knew about the deal to sell excess power back to the grid and as that was unknown to Cloudflare it is very unlikely that it would have been a test condition.

      My take on the lesson here is that if you need things to be done reliably at cloud scale, you either have to be able to quickly scale horizontally across facilities (challenging as your interconnects either become the bottleneck on scalability or the cost of additional facilities becomes a significant factor in scaling) or you run the data centres yourself to allow these risks to be managed inline with your company's goals.

      Or you try to be transparent and hope the explanations are sufficient to satisfy customers and you keep enough systems up to get by.

      1. A Non e-mouse Silver badge

        I think you're looking at two different test suites: One to test the power failover in the possible failure scenarios. The other is testing your systems in the event they loose power.

    2. Doctor Syntax Silver badge

      "Repeat after me: An untested backup is a worthless backup."

      Also repeat after me: Cloud is just somebody else's computer data centre and they control it.

  2. Mike007

    Is it Google who have a testing strategy whereby one of their senior managers has the authority to phone any data centre and say "press the red button"?

    1. Anonymous Coward
      Anonymous Coward

      At our place, every month the incoming power is cut to the data centre to check all the power fail overs work.

      1. Dimmer Bronze badge

        Lessons:

        Battery life is 1/2 what is expected and 1/4 what is rated.

        If the power blinks 3 times, the 4th it is down for good long time.

        Redundancy sometimes causes the problems.

        Alway run your backup site as a hot site.

        Primary and secondary systems should be rotated and never use the same control system for both.

        You are screwed without the right people.

        Management and bean counters are the primary cause of failed redundancy. See previous statement

        An outage because of a blown generator engine will be understood by a customer when an outage because of a bad sensor will not.

        I would appreciate you guys telling about any hard won lessons you have had.

        1. Anonymous Coward
          Anonymous Coward

          Re: Lessons:

          I used to do some work in lighting & sound for a small venue. When we were looking at lighing kit or sound amplifiers, we'd always take the covers off and look at the ratings of the power transistors. Bitter experience told us that these needed to be rated to be least at double their anticipated max load to last. Kit that didn't have that headroom rarely survived a year. (In which case we'd just replace the transistors with "correctly" rated ones and the problems disappeared)

          1. David Hicklin Bronze badge

            Re: Lessons:

            > ratings of the power transistors

            Similar thing at home with some touch sensitive bedside lamps, touch the base and they came on dim, touch again and got brighter before they cycled to off after the 4th touch, very nice.

            we had 3 "almost" identical, 1 from John Lewis and 2 cheaper ones from ASDA, the cheaper ones failed after the bulb failed for the first time as their triacs were rated right on the limit, whilst the JL one is still going years later as it has a much beefier (volts and amps rating) triac.

        2. A Non e-mouse Silver badge

          Re: Lessons:

          Redundancy sometimes causes the problems.

          Totally agree. You should only use HA if you really need it. I've wasted far too much time trying to persuade HA kit that had thrown a wobble when it was in the wrong state. (I.e. broken when it thought it was fine or fine when it thought it was broken)

          And just like cryptographry: Don't try and invent your own HA magic sauce. You will miss so many failure modes. If you can't afford to do HA properly, don't do it at all.

      2. Jim Whitaker

        That's brave, at least the first time.

    2. Doctor Syntax Silver badge

      How many data centres is he authorised to call at any one time?

    3. Phil O'Sophical Silver badge

      It's worth looking up the Netflix Chaos Monkey, as an example of HA testing.

      I've lost count of the number of HA system designs which supposedly handle failures, yet the failover framework itself assumes that everything will work correctly when disaster strikes.

  3. Doctor Syntax Silver badge

    "We had never tested fully taking the entire PDX-04 facility offline."

    That sort of thing is scary; scary enough to duck.

  4. TaabuTheCat

    You missed a remarkable part of of the post-mortem

    "the overnight staffing at the site did not include an experienced operations or electrical expert — the overnight shift consisted of security and an unaccompanied technician who had only been on the job for a week."

    For what I assume is a Tier 4 DC hosting critical services? Flexential have some 'splaining to do.

    1. tip pc Silver badge
      Holmes

      Re: You missed a remarkable part of of the post-mortem

      the overnight shift consisted of security and an unaccompanied technician who had only been on the job for a week.

      my money is on the unaccompanied technician doing something he wasn't adequately trained to do.

      how long till the who, me? or on call.

  5. pdh

    Correction

    "We had never tested fully taking the entire PDX-04 facility offline."

    Sure you did -- just a couple of days ago.

  6. John Klos

    Cloudflare want us to trust them, but...

    They want to recentralize the Internet around them.

    They want to host and say they don't host, so they don't have to handle abuse, by redefining the word, "host".

    They want to host known spammers and scammers because "free speech".

    They want people all over the world to send their DNS queries to them via DoH.

    They want to marginalize most of the non-western world by having CAPTCHAs on every web site.

    And so on.

    They try to distract from their nefarious activities using tons of seemingly positive things, like cheerful participation on Hacker News and by offering free services (which do little more than begin the process of addiction and dependency).

    I'm glad they're this dumb that they have outage after outage that show how the Internet is worse for using Cloudflare, because if they worked perfectly, many people would never know.

  7. Korev Silver badge
    Pint

    "We had never tested fully taking the entire PDX-04 facility offline. As a result, we had missed the importance of some of these dependencies," wrote Prince, and we appreciate the honesty.

    Absolutely.

    Obviously the single dependency on PDX-04 wasn't good; but this gives you confidence that it'll probably be fixed in the near future...

    A pint for the Cloudfare people who probably need one -->

    1. Excused Boots

      Yes quite probably this particular dependency will be fixed, and maybe even one or two others which tumble out of the woodwork. Fine.

      Except what they then need to do is to give it a month or so and then deliberately cause a complete shutdown and see what else doesn’t work. Fix that, and rinse and repeat until they can drop a whole, data centre and nobody really notices. Then, and only then can they say they have a proper fully redundant and (reasonably) disaster-proof system.

      Of course, the problem is that until you get to the final state, customers absolutely will be inconvenienced, your company gets bad press in organs such as el. Reg, you lose money and take reputational damage, and the c-suite people get cold feet. So the full up tests don’t really happen.*

      * I suspect mostly; yes there probably are some companies prepared to do this and run the risk because it helps them (and ultimately their customers) in the long term.

  8. gitignore

    RCA

    I was taught years ago to do root cause analysis, to find the root cause of a failure.

    I don't think, in 20+ years in the IT industry and some in EE before that, that I have ever seen a _single_ root cause of an issue. It's almost always a confluence of several unlikely intersecting issues.

    1. usbac Silver badge

      Re: RCA

      Look up the Swiss Cheese model that is used in aviation safety.

      I have read many NTSB air crash reports. Almost every accident occurs due to a number of often small events all lining up just at the wrong time. In most cases, several of these items happening at the same time would not have led to the accident, but just the right combination happened on that day, and people lost their lives.

      1. collinsl Bronze badge

        Re: RCA

        Perfect example of this:

        Plane crashed due to a burned out bulb in the landing gear indicator lights - the crew became preoccupied with checking if the landing gear was actually down and locked so they didn't notice their descent rate had increased and the autopilot was off, leading to them crashing into a swamp in Florida.

        https://en.wikipedia.org/wiki/Eastern_Air_Lines_Flight_401

  9. ChoHag Silver badge

    Until you've restored the data unplugged the data centre, you don't have a backup disaster recovery plan.

  10. PRR Bronze badge

    > .. and the breakers were found to be faulty

    I was vicariously involved in another datacenter power collapse, and it too revealed faulty breakers.

    Are datacenter breakers stressed differently than home or factory breakers?

    "We" (me and the power company) are cautious about residential breakers because they start CHEEP and are neglected. But aren't $500-$5,000 breakers made of sterner stuff, and tested with malice?

    1. collinsl Bronze badge

      Depends - a rat bridged the contacts in a breaker somewhere I worked and it blew the whole thing up. The breaker as well as the rat.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like