back to article Cloudflare comes clean on crashing a chunk of the web: How small errors and one tiny bit of code led to a huge mess

Cloudflare has published a detailed and refreshingly honest report into precisely what went wrong earlier this month when its systems fell over and took a big wedge of the internet with it. We already knew from a quick summary published the next day, and our interview with its CTO John Graham-Cumming, that the 30-minute global …

  1. JeevesMkII
    FAIL

    It never gets any less true.

    "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

    1. JassMan
      Trollface

      Re: It never gets any less true.

      Some people when confronted by a problem DON'T think.

      FTFY

    2. Mage Silver badge
      Alert

      Re: It never gets any less true.

      Regex is wonderful and frightening. I use it to do complex edits and find errors proof reading can't find. Obviously Saving first and Ctrl-Z are both good.

      I'm too much of a coward to put regex into production release code, or a web page unless it's very simple and doesn't loop. I test that on a test machine. I've also used it to convert WS3 files to plain text and sometimes that did unexpected things.

    3. Doctor Syntax Silver badge

      Re: It never gets any less true.

      One of the things to remember when writing regular expressions is that just because it's possible to do some particular thing it isn't necessarily a good idea.

    4. Down not across

      Re: It never gets any less true.

      I use perl and regular expressions all the time. Just like most other programming constructs, you need to engage brain and think what you're doing.

      That doesn't mean I haven't made mistakes and had to debug WTF is going on when things have not worked quite as expected. Usually issues are due to missing some rare (or thought to be nonexistent) case in testing.

      Debugging regex issues can be challenging, but that doesn't change that fundametally they are extremely useful.

      TLDR; You can pry regex out of my dead cold hands

      Oh and thanks for the honest explanation Cloudflare, you certainly have gone up few notches in my book. Very refreshing.

      1. Psmo

        Re: It never gets any less true.

        If all you have is a hammer, every problem looks like a hammer, every problem looks like a hammer, every problem looks like a hammer, every<SIGNAL LOST>

  2. This post has been deleted by its author

  3. elDog

    For all the regex haters out there: Don't forget that machine code is the same

    A regex expression is comprised of lots of operations strung together, just like hand-coded assembler or compiler generated code.

    While we might say that compilers have withstood the trials of time and that the code they generate is almost (99.999996%) correct, this has only been achieved through years of real-world testing.

    True, regex's are hard to read for you mere mortals. But then so is assembler, firmware, C++, Ada, etc. Time to bring back MS Basic!

    1. NullNix

      Re: For all the regex haters out there: Don't forget that machine code is the same

      And they're not fixing this by not using regexes: they're fixing this by switching to a NFA-based regex engine, so the nearly-exponential explosion no longer happens. (Instead, you can get an explosion in NFA states, but this is statically detectable at regex compile time rather than stabbing you in the face at runtime without warning. Much better.)

      1. Psmo

        Re: For all the regex haters out there: Don't forget that machine code is the same

        this is statically detectable at regex compile time rather than stabbing you in the face at runtime without warning. Much better.

        I realise everything is relative but if the choice is blowing away a few toes or getting stabbed in the face I'd rather stay in bed.

  4. Nick Stallman

    All relativr

    Compared to other cloud outages,this one is very minor. Not only was it detected and acknowledged quickly, it was also resolved extremely quickly and the postmortem let's you know exactly what went wrong in great detail.

    Outages happen. If only they were all this pleasant to experience.

  5. swm

    When something goes wrong in a complicated system it can sometimes be difficult to pinpoint the problem. On the Dartmouth Time Sharing System we would sometimes get bugs that would show up after months of flawless operation. When we finally tracked down the bug and analyzed what the effects of the bug would be we were surprised that the system worked at all. Checking recent code changes would, of course, not be profitable in such a case.

    Debugging under pressure is hard.

  6. chuBb.

    Fair play to cloudflare for the openness

    Sounds like there S.I.O.F plan (shit it's on fire) worked, could do with a polish but a solid B+ in terms of response. To be fair these sorts of plans are always best guesses so finding delays in the 2fa email hitting inboxes isn't a massive problem for example, although I would hope they invest in an additional factor like keyfobs to mitigate delays in inbox access from killing your network.

    Of course I'm looking at this as though they were a normal company and not a pervasive part of net infrastructure, so I'm ignoring the damage done to customers, but even so the fact they kept heads under that pressure and stuck to a plan shows good discipline and training (as you would expect with the responsibility they have)

    So yeah my take is that the response is acceptable, they will clearly be patching holes in the process, and until the plan is put to test under real conditions you can't know where the deficiencies in it lie

  7. STOP_FORTH
    Linux

    Own goal

    So, basically they fork-bombed themselves?

    Impressive.

    Linus won't let us do this anymore.

  8. mark l 2 Silver badge

    One major outage in 6 years is pretty good if we compare that against the number of Office365 problems Microsoft have had in the same period, and arguably Microsoft is a much bigger company, is better known and has deeper pockets so should be more reliable that Cloudflare on paper.

    1. SW10

      Yeah, paper’s great.

      It’s that transition to ASCII characters that always screws me up.

  9. Jason Bloomberg Silver badge
    Pint

    Automated Roll-Back

    The single thing which would have been useful is something more informative than "502 Gateway Error". It took a while to figure out it was "Cloudfare done bad" and not something else.

    I am wondering how well an automated roll-back when something is deployed and the system goes down would work? I expect there is potential for that to go wrong or compound the issues. If such a thing were in place and had worked it may have reduced the down-time to around a third of what it was..

    Not that I think the half hour it took was unreasonable. And the near hour for the final fix was not that bad either considering 'we've just screwed up badly, and we don't want another' would have been at the forefront of everyone's minds.

    And they certainly deserve credit for something more than mere platitudes about how their number one priority is serving customers with complete disregard to having failed to do that.

    1. Anonymous Coward
      Anonymous Coward

      Re: Automated Roll-Back

      An automated rollback only helps if the bug triggered was introduced by a recent commit. That wouldn't help at all when the bad code is actually introduced long before hand.

  10. anothercynic Silver badge

    Kudos to CloudFlare

    For at least being honest.

    Usually you get PR fudge that may or may not point the finger at the company itself (and holding up its hand saying 'mea culpa, mea maxima culpa'.

    This is a good, refreshing post that is clear about 'we fucked up, and we're fixing it this way'.

    1. Anonymous Coward
      Anonymous Coward

      Re: Kudos to CloudFlare

      It's weird that more companies don't do this. I've taught a bit of PR at a business college (even though I have only the slightest experience in the field - it was a shitty college, ok? Oh, and I mean college in the UK sense - not like in the US) and I used to run the sessions by looking at a bunch of case studies, good and bad, and getting the students to come up with lists of best practice for dealing with similar situations. The 'we fucked up' one invariably ended up looking like:

      1) Admit something went wrong

      2) Admit it was your fault, and say sorry

      3) Explain what happened (potentially going so far as to give names, but don't throw people under the bus)

      4) Explain what you did or are doing to make this particular mess better (possibly give compensation, but if you do the rest of this well enough, most of the time this isn't necessary)

      5) Explain what you're doing to try to avoid it happening again

      Obviously, you need to actually do the things you say in 4 and 5, but unless you really have no concern about losing business, you'll do that as a matter of course. CloudFlare did all of this (no idea whether they compensated anyone) and they're getting exactly the kind of forgiving response from their paying customers I'd expect. I really can't understand why any companies still go down the route of issuing vague apologies and payouts - most customers don't actually want compensation, or the meaningless apology: they want the bad thing not to have happened, and if you acknowledge that the bad thing did happen, but explain logically how you'll avoid it happening again, people are usually satisfied with that.

  11. Doctor Syntax Silver badge

    A good rule of thumb is that authorisation to roll out a change includes authorisation to roll it back in an emergency. It shouldn't need someone else to be consulted.

    A second is that if things go pear-shaped promptly on rolling out a change it should be rolled back PDQ. Even if the problem was actually something else you're no worse off than you were before and at least you now know it wasn't the change.

    However this is the way to handle the PR side - not the self-serving, transparently untrue boilerplate response we usually get. It actually raises Cloudflare's reputation.

  12. jamesb2147

    Where's Kieren?

    A Kieren story that doesn't utterly dump on all the parties involved?

    I believe this to be the first evidence that the robots have begun quietly replacing us.

    1. Psmo

      Re: Where's Kieren?

      A Kieren story that doesn't utterly dump on all the parties involved?

      Bleep bloop PutDownNotFoundError

  13. Michael Wojcik Silver badge

    That damned phrase

    We're tempted to use the phrase-du-jour "perfect storm,"

    Resist the temptation, regardless of circumstances.

    First, that phrase is, at best, du-2000, when the (overrated) eponymous film was released. It's been nearly two decades. Enough.

    Second, it's a grating, idiotic expression that contributes nothing of value to the sentence in which it appears, even if it were novel.

    Third, it has been direly overused, particularly by the sort of people who would do us all a favor by just shutting the hell up.

    1. Psmo

      Re: That damned phrase

      Third, it has been direly overused, particularly by the sort of people who would do us all a favor by just shutting the hell up.

      Fourth, it lends itself to a ranty debunking by random commenters that obfuscates any real content

      Fifth, some wag will take it upon himself to address said ranty debunking therefore perpetuating the cycle.

      ...

      <maximum recursion depth exceeded, please reboot universe>

  14. fredesmite
    Mushroom

    Remember - Cloud computing

    Is nothing more than putting your crap on someone else's computer that other people are using , and expecting the owners to care more about it than you do..

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like