back to article Cloudflare gave everyone a 30-minute break from a chunk of the internet yesterday: Here's how they did it

Internet services outfit Cloudflare took careful aim and unloaded both barrels at its feet yesterday, taking out a large chunk of the internet as it did so. In an impressive act of openness, the company posted a distressingly detailed post-mortem on the cockwomblery that led to the outage. The Register also spoke to a weary …

  1. Jamie Jones Silver badge
    Happy

    the updates are pushed out to a small group of customers "who tend to be a little bit cheeky with us" and "do naughty things" before it is progressively rolled out to the wider world.

    Nice!

    Still, if you are a cloudflare customer who likes to be on the bleeding edge, simply do "naughty things"!

    1. Jim Mitchell
      Devil

      I would think that the customer set in question is also Cloudflare's most sticky, as they already been bounced elsewhere.

  2. Steve K Silver badge

    This is the other meaning of "Serverless Computing"

    This is the other meaning of "Serverless Computing" - fortunately temporarily...

    1. pavel.petrman Silver badge

      Re: This is the other meaning of "Serverless Computing"

      Or is the internetless computing becoming a thing? As a service, of course.

      1. John Robson Silver badge

        Re: This is the other meaning of "Serverless Computing"

        Internot as a service...

  3. Loyal Commenter Silver badge

    This is an important lesson in the testability of regular expressions

    Unless you can step through the parsing engine and test every possible input, it is very hard to spot issues with non-trivial regexes.

    This is a problem that has been known about for a long time, and there is a nice discussion of it in Larry Wall's "Programming Perl" where using lookaheads in an expression in a certain way can cause it to never complete execution (at least not in the expected lifetime of the universe).

    I'm not saying not to use them, just be very, very careful when you do, and make sure you've double-checked you understand what the expression is actually asking for!

    1. phuzz Silver badge
      Facepalm

      Re: This is an important lesson in the testability of regular expressions

      If it caused 100% CPU on every machine, then it can't have been a particularly obscure or rare string. Presumably their testing didn't include a representative set of data to test on, otherwise it should have showed up straight away.

      It's one thing making a mistake that only cocks up occasionally under rare circumstances, it's another to immediately peg every CPU to max...

      1. Loyal Commenter Silver badge

        Re: This is an important lesson in the testability of regular expressions

        Without knowing the nature of the regex in question, it could have been anywhere between common and obvious to rare and subtle, and without being a master of reading regexes (like SQL execution plans, it's more an art than a science), it may still have been easily missed on visual inspection.

        It could, for instance, have been a pattern that causes an infinite loop under certain circumstances, which aren't found in the test environment (because of a missing test case) but are common in the wild, and its possible that only a single instance of the offending pattern was required in the wild to trigger the regex in that way. Again, without seeing the regex, it's impossible to say one way or the other.

        1. e^iπ+1=0

          Without knowing the nature of the regex in question,

          Exactly - divulge the guilty regex, or it didn't happen (#deepfake).

          1. Loyal Commenter Silver badge

            Re: Without knowing the nature of the regex in question,

            Of course, it could be that the regex in question is longer than this entire comments thread, and it'll take you a couple of weeks to work out what it's for, let alone what it's actually doing.

            Case in point - google "email address regular expression" and look at some of the examples.

            1. DCFusor

              Re: Without knowing the nature of the regex in question,

              Some of them are utterly "out there" to be sure - every book I've got on perl or just things like "mastering Regexes" point out that this particular job is not possible to correctly implement with a regex, and if you really want to validate email addresses, you should use some programming in whatever language you're using and only use smaller regexes for the parts they serve best.

              I think 100% of the references warn of this, or at least, all the one's I've seen. It's like "this is an example of what not to do" in the ones I have read.

        2. Doctor Syntax Silver badge

          Re: This is an important lesson in the testability of regular expressions

          "reading regexes (like SQL execution plans, it's more an art than a science)"

          I wonder if it's possible to have a regex parser compile in test mode to give an execution plan. But I suspect the result would be a lot harder to understand than an SQL execution plan.

      2. maffski

        Re: This is an important lesson in the testability of regular expressions

        My suspicion would be that it was data volume rather than any particular data set that was the issue.

      3. theblackhand

        Re: This is an important lesson in the testability of regular expressions

        My guess is the pipeline is something along the lines of:

        - write a rule

        - validate rule and add to rulebase

        - check rulebase in monitor mode against pre-canned test traffic

        - check rulebase in enforce mode against pre-canned test traffic

        - check rulebase in monitor mode against sample traffic containing items to block

        - check rulebase in enforce mode against sample traffic containing items to block

        - check rulebase in monitor mode against production traffic for select customers

        - check rulebase in enforce mode against production test traffic for select customers

        - deploy to production

        This is based largely on (historical?) Google checks for firewall rule changes. As long as the hit counts/device health stats don't show anything scary, everything should be good.

        I wonder if the canned/sample traffic didn't trigger CPU usage in quite the same way (i.e. in small doses the checks remain in CPU caches but as traffic rises above X it starts to cause high latency with memory reads and the CPU is least waiting) or resulted in additional CPU to fully process the rule with some production traffic (repeated calls to a script possibly?) that wasn't fully considered.

        1. Claptrap314 Silver badge

          Re: This is an important lesson in the testability of regular expressions

          The interview suggests that the were skipping a couple of steps.

          1. theblackhand

            Re: This is an important lesson in the testability of regular expressions

            I missed the bit about jumping from test to globally deployed and missing the "select few test customers"

    2. Lee D

      Re: This is an important lesson in the testability of regular expressions

      Seems to me like what they need is a limit or timeout on how long regex's can take - alerting and terminating if they go over.

      You don't need fancy tech. What you need is not to swamp 100% CPU on huge multicore devices when you could have just said "Has this regex taken more than 10ms to execute? Then kill it and tell the admin so we don't fall over globally".

      1. Loyal Commenter Silver badge

        Re: This is an important lesson in the testability of regular expressions

        The problem here, is that if the purpose of that regex is to trap malicious code, then malicious code authors would exploit this to slip their payload past the regex, by deliberately putting something that would cause it to time out in the script. Ironically, in attempting to increase security, by closing off part of the attack surface, this would decrease it, by opening another (potentially bigger) hole.

        1. Lee D

          Re: This is an important lesson in the testability of regular expressions

          Whereas at the moment, the attackers can only DoS their entire infrastructure with bad source data on a poorly-written regex. So much better!

          You write the regex so that it's written properly. So that it doesn't matter what data it's given, it can resolve it within a set time. If it can't do that, then you can' t use it anyway as it will introduce *so much* latency into the system that it turns into a DoS and becomes useless.

          You're confusing "source data" (hacker controlled) with "regex expression" (Cloudflare controlled). If the regex can't deal with the source data in time, it should alert. There's a clue in that word... alert.

          If it alerts on every damn page you go on that has a bit of Javascript, it's useless anyway but at least it didn't bring half the globe down with it.

          And then realise that maybe, just maybe, regex hunting is no better or different to AV signatures - which also exhibit this same problem.

          If a malicious attacker can control the data in the page to the point that they can make your regexs timeout, then they can do a lot worse anyway. Hell, "give up" and return an error in that instance. You'll still have *much less* impact than taking down your entire CDN because of a multitude of over-running regexs from a handful of sites. You'll just have a handful of sites that don't work, rather than an entire international company service.

          1. Loyal Commenter Silver badge

            Re: This is an important lesson in the testability of regular expressions

            the point I was trying to make here is that if you have a bit of code that "scans" something for dodgy scripts, using a regex, then if an attacker knows that this will time out and "pass" the script then you are opening yourself up to an attack where the attacker crafts a bit of script that will time-out the search and in doing so, stops the whole scanning process, allowing other "nasties" to get through. This presupposes that teh same bit of code that times out is also responsible for doing that, but if we're talking about doing stuff with regexes, it's a possiblity that this is how it is working, or alternatively, a timeout is set ont eh total time to check something, in which case the regex times out and prevents anythnig else after it from running, allowing things to slip through.

            The better solution is not to use a regex at all, or use one that is so trivial that it is not possible to feed it input it will choke on. If you must use something like this, you should never allow something to pass through after timing out. It should be an instant fail. The flip side of that is, that if you get it wrong, and make it inadvertantly CPU-expensive under certain circumstances (as here), then even if you do have a timeout, you either have a trivial DoS channel, by flooding with things that will time-out, or you have a timeout so short that legitimate things get flagged up, and you get flooded with false negatives.

  4. Blockchain commentard
    Facepalm

    I like

    including DNS over HTTPS (DoH)

    Where's the Homer icon?

    1. Anonymous Coward
      Anonymous Coward

      Re: I like

      Ah that would be why my PiHole suddenl

  5. Anonymous Coward
    Anonymous Coward

    The dangers of not thinking clearly

    are much greater now than ever before. It's not that there's something new in our way of thinking - it's that credulous and confused thinking can be much more lethal in ways it was never before. Carl Sagan

  6. Anonymous Coward
    Anonymous Coward

    To put this short, nobody did it

    It was an automated script that didn't do what it was supposed to do. Bad, script! Bad, script, go to your storage (room) and stay there!

    It's good to be king!

  7. Luke McCarthy

    Back to comp-sci school

    That's what you get for using a regular expression engine that uses nondeterministic finite automata (with backtracking). For applications at this kind of scale, only a DFA will do.

  8. Ol'Peculier

    Obligatory quote...

    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

    1. Christoph

      Re: Obligatory quote...

      They should have written in in APL.

    2. Loyal Commenter Silver badge

      Re: Obligatory quote...

      obligatory xkcd

    3. Psmo

      Re: Obligatory quote...

      Some people only have one joke.

      So they introduce regular expressions.

      Now they have <SIGNAL LOST>

    4. Claptrap314 Silver badge
      Devil

      Re: Obligatory quote...

      I've never had a problem of that sort. Perhaps you should study state machines more carefully.

  9. Psmo
    Holmes

    To err is human...

    ...to really fuck things up you need a computer.

    ...to propagate the error globally in pseudo-real-time to fuck everything up simultaneously is DevOps.

    1. yoganmahew

      Re: To err is human...

      And CI/CD, which seems to be yet another pup sold as a way to get rid of people. "I know, let's automate everything, then we can get rid of all those checks and balances that keep our systems up! Those checks and balances cost money you know!".

      Who needs 5 nines anyway, 4.5 nines is great, 3 nines is bigly, sure our parents were delighted with 2 nines...

      One trick company tools and processes are not appropriate for complex large businesses.

  10. Ken Moorhouse Silver badge

    WAF

    I think you will find that WAF is an abbreviation used exclusively to signify the state of the UK after Brexit.

    1. John Robson Silver badge

      Re: WAF

      Isn't the A normally a T?

      1. Ken Moorhouse Silver badge

        Re: WAF

        Twas I believe Evil Auditor who coined the phrase in the context of Brexit.

        https://forums.theregister.co.uk/forum/all/2019/02/28/brexit_border_it_systems/#c_3728276

  11. Claptrap314 Silver badge

    Detailed postmortem?

    Honest, yes. But there are a lot of questions about the rollout scripts that matter.

    But it sounds like this was an edge case that was missed. Might have only happened one query in 1000000. Still would have taken 100% of their servers out in seconds.

    If the regex was actually generated, then you get a whole 'nother set of difficulties in catching these.

  12. mutt13y

    Honest RFO

    At least they were honest and frank about what caused the outage. It is quite refreshing instead of the usual corporate BS.

  13. fredesmite

    Remember - Cloud computing

    Is nothing more than letting some lowest bidder provider let you run your crap on their machines that other people are using at the same time

    1. Orv

      Re: Remember - Cloud computing

      When it comes to content delivery networks -- what Cloudflare does -- there aren't really other good options. Unless you plan to scatter your own servers all over the globe, and make sure you buy enough of them for your absolute peak use case (which means most of the time a lot of them will be idle.)

  14. Joseba4242

    Re: I'm worried they'll outlaw Kodi in some unenforceable way...

    While Cloudflare was refreshingly open about the technical cause for the outage, they have not touched on the process issues.

    There's always a possibility that a change, however thoroughly lab tested, has unintended consequences in a large-scale production environment.

    That's what good old phased rollouts are for. Why have they make an immediate, global change? That approach should have raised a very big, crimson red flag.

    1. IGotOut Silver badge

      Re: I'm worried they'll outlaw Kodi in some unenforceable way...

      That's what good old phased rollouts are for. Why have they make an immediate, global change?

      Go back, read the WHOLE article

  15. Doctor Syntax Silver badge

    I can appreciate that for a business such as Cloudflare "we may be under attack" is the normal first-out-of-the-box reaction to a sudden problem. However when a change has just been rolled out the possibility that the change may have been responsible should take precedence. Rolling back the change, assuming that can be done quickly, should be a fairly obvious - and prompt - response. At the very least it triages the most likely cause over which you have control and in the worst case leaves you no worse of than before you rolled out the change.

    1. DCFusor

      At least one of the early reports I read (which could have been wrong) said they DID try to roll it back right away - but at 100% cpu usage, the attempt didn't work out so well, being behind in the priorities compared to already running things.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2022