Repeat after me: An untested backup is a worthless backup.
Vanishing power feeds, UPS batteries, failover fails... Cloudflare explains that two-day outage
Cloudflare has explained how it believes it suffered that earlier multi-day control plane and analytics outage. The executive summary is: a datacenter used by Cloudflare apparently switched with little or no warning from operating on two utility power sources, to one utility source and emergency generators, to just UPS …
COMMENTS
-
-
Tuesday 7th November 2023 09:56 GMT theblackhand
Would testing have helped here?
My reading of the power situation was that IF the data centre had used it's generators for powering just the data centre, the earth issue likely would not have happened.
Only the data centre knew about the deal to sell excess power back to the grid and as that was unknown to Cloudflare it is very unlikely that it would have been a test condition.
My take on the lesson here is that if you need things to be done reliably at cloud scale, you either have to be able to quickly scale horizontally across facilities (challenging as your interconnects either become the bottleneck on scalability or the cost of additional facilities becomes a significant factor in scaling) or you run the data centres yourself to allow these risks to be managed inline with your company's goals.
Or you try to be transparent and hope the explanations are sufficient to satisfy customers and you keep enough systems up to get by.
-
-
-
-
Tuesday 7th November 2023 12:51 GMT Dimmer
Lessons:
Battery life is 1/2 what is expected and 1/4 what is rated.
If the power blinks 3 times, the 4th it is down for good long time.
Redundancy sometimes causes the problems.
Alway run your backup site as a hot site.
Primary and secondary systems should be rotated and never use the same control system for both.
You are screwed without the right people.
Management and bean counters are the primary cause of failed redundancy. See previous statement
An outage because of a blown generator engine will be understood by a customer when an outage because of a bad sensor will not.
I would appreciate you guys telling about any hard won lessons you have had.
-
Tuesday 7th November 2023 17:43 GMT Anonymous Coward
Re: Lessons:
I used to do some work in lighting & sound for a small venue. When we were looking at lighing kit or sound amplifiers, we'd always take the covers off and look at the ratings of the power transistors. Bitter experience told us that these needed to be rated to be least at double their anticipated max load to last. Kit that didn't have that headroom rarely survived a year. (In which case we'd just replace the transistors with "correctly" rated ones and the problems disappeared)
-
Wednesday 8th November 2023 13:58 GMT David Hicklin
Re: Lessons:
> ratings of the power transistors
Similar thing at home with some touch sensitive bedside lamps, touch the base and they came on dim, touch again and got brighter before they cycled to off after the 4th touch, very nice.
we had 3 "almost" identical, 1 from John Lewis and 2 cheaper ones from ASDA, the cheaper ones failed after the bulb failed for the first time as their triacs were rated right on the limit, whilst the JL one is still going years later as it has a much beefier (volts and amps rating) triac.
-
-
Tuesday 7th November 2023 17:49 GMT A Non e-mouse
Re: Lessons:
Redundancy sometimes causes the problems.
Totally agree. You should only use HA if you really need it. I've wasted far too much time trying to persuade HA kit that had thrown a wobble when it was in the wrong state. (I.e. broken when it thought it was fine or fine when it thought it was broken)
And just like cryptographry: Don't try and invent your own HA magic sauce. You will miss so many failure modes. If you can't afford to do HA properly, don't do it at all.
-
-
-
-
Tuesday 7th November 2023 12:02 GMT TaabuTheCat
You missed a remarkable part of of the post-mortem
"the overnight staffing at the site did not include an experienced operations or electrical expert — the overnight shift consisted of security and an unaccompanied technician who had only been on the job for a week."
For what I assume is a Tier 4 DC hosting critical services? Flexential have some 'splaining to do.
-
Tuesday 7th November 2023 12:57 GMT tip pc
Re: You missed a remarkable part of of the post-mortem
the overnight shift consisted of security and an unaccompanied technician who had only been on the job for a week.
my money is on the unaccompanied technician doing something he wasn't adequately trained to do.
how long till the who, me? or on call.
-
-
Tuesday 7th November 2023 13:32 GMT John Klos
Cloudflare want us to trust them, but...
They want to recentralize the Internet around them.
They want to host and say they don't host, so they don't have to handle abuse, by redefining the word, "host".
They want to host known spammers and scammers because "free speech".
They want people all over the world to send their DNS queries to them via DoH.
They want to marginalize most of the non-western world by having CAPTCHAs on every web site.
And so on.
They try to distract from their nefarious activities using tons of seemingly positive things, like cheerful participation on Hacker News and by offering free services (which do little more than begin the process of addiction and dependency).
I'm glad they're this dumb that they have outage after outage that show how the Internet is worse for using Cloudflare, because if they worked perfectly, many people would never know.
-
Tuesday 7th November 2023 13:35 GMT Korev
"We had never tested fully taking the entire PDX-04 facility offline. As a result, we had missed the importance of some of these dependencies," wrote Prince, and we appreciate the honesty.
Absolutely.
Obviously the single dependency on PDX-04 wasn't good; but this gives you confidence that it'll probably be fixed in the near future...
A pint for the Cloudfare people who probably need one -->
-
Tuesday 7th November 2023 21:18 GMT Excused Boots
Yes quite probably this particular dependency will be fixed, and maybe even one or two others which tumble out of the woodwork. Fine.
Except what they then need to do is to give it a month or so and then deliberately cause a complete shutdown and see what else doesn’t work. Fix that, and rinse and repeat until they can drop a whole, data centre and nobody really notices. Then, and only then can they say they have a proper fully redundant and (reasonably) disaster-proof system.
Of course, the problem is that until you get to the final state, customers absolutely will be inconvenienced, your company gets bad press in organs such as el. Reg, you lose money and take reputational damage, and the c-suite people get cold feet. So the full up tests don’t really happen.*
* I suspect mostly; yes there probably are some companies prepared to do this and run the risk because it helps them (and ultimately their customers) in the long term.
-
-
Tuesday 7th November 2023 13:46 GMT gitignore
RCA
I was taught years ago to do root cause analysis, to find the root cause of a failure.
I don't think, in 20+ years in the IT industry and some in EE before that, that I have ever seen a _single_ root cause of an issue. It's almost always a confluence of several unlikely intersecting issues.
-
Tuesday 7th November 2023 15:52 GMT usbac
Re: RCA
Look up the Swiss Cheese model that is used in aviation safety.
I have read many NTSB air crash reports. Almost every accident occurs due to a number of often small events all lining up just at the wrong time. In most cases, several of these items happening at the same time would not have led to the accident, but just the right combination happened on that day, and people lost their lives.
-
Tuesday 21st November 2023 17:04 GMT collinsl
Re: RCA
Perfect example of this:
Plane crashed due to a burned out bulb in the landing gear indicator lights - the crew became preoccupied with checking if the landing gear was actually down and locked so they didn't notice their descent rate had increased and the autopilot was off, leading to them crashing into a swamp in Florida.
https://en.wikipedia.org/wiki/Eastern_Air_Lines_Flight_401
-
-
-
Wednesday 15th November 2023 18:27 GMT PRR
> .. and the breakers were found to be faulty
I was vicariously involved in another datacenter power collapse, and it too revealed faulty breakers.
Are datacenter breakers stressed differently than home or factory breakers?
"We" (me and the power company) are cautious about residential breakers because they start CHEEP and are neglected. But aren't $500-$5,000 breakers made of sterner stuff, and tested with malice?