back to article The months and days before and after CrowdStrike's fatal Friday

The great irony of the CrowdStrike fiasco is that a cybersecurity company caused the exact sort of massive global outage it was supposed to prevent. And it all started with an effort to make life more difficult for criminals and their malware, with an update to its endpoint detection and response tool Falcon. Earlier today, …

  1. Andrew Hodgkinson

    Just bad luck?!

    WTAF? The idea of NEVER doing global updates, always doing rolling, is basic practice; not even best practice. And their idea of force-pushing an update, ignoring security policies for the organisations that trusted them, was cavalier and arrogant.

    Windows is to blame for being so fragile that a driver error can crash the entire kernel. CrowdStrike are to blame for buggy tests, a buggy validator, a buggy file reader, a dreadful forced-update policy, and a dreadful all-or-nothing global update system.

    And... Absolutely nothing substantial will change as a result of this.

    1. gv

      Re: Just bad luck?!

      "To err is human, but to really foul things up you need a computer."

    2. Dan 55 Silver badge

      Re: Just bad luck?!

      As long as the anti-virus industrial complex claims that the danger they protect against is so great that they must react so fast there's no time to properly test and as long as CSO and CTOs believe that their computers are protected by the snake oil sold by these companies, nothing will change.

      1. goblinski Bronze badge

        Re: Just bad luck?!

        As long as the anti-virus industrial complex claims that the danger they protect against is so great that they must react so fast there's no time to properly test and as long as CSO and CTOs believe that their computers are protected by the snake oil sold by these companies, nothing will change.

        I'll respectfully call utter bullshit on that :-P

        Specifically - the dangers they protect against are NOT so great, and they must NOT react so fast there's no time to properly test, because ?

    3. UnknownUnknown

      Re: Just bad luck?!

      Did they not test it on any real world kit - some VM’s, PC’s, Self-checkouts, digital signage, ELoS registers, embedded …. or just throw it out the door… just through automated testing systems ??

      They have global partners like - cough - AMG Mercedes F1 Team for example.

      That common sense seems omitted from their first PIR excuses.

      QA environments and labs cost money … for a reason.

      Agile …. Innit !?!

      1. Doctor Syntax Silver badge

        Re: Just bad luck?!

        Consider this: the application that actually reads the file found nothing wrong with it and applied it. They now say they did apply some QA testing which also found nothing wrong with it. What's the chances that they didn't simply re-use whatever code that the application has to do the pre-release QA?

        That's all very well providing that code in question is completely effective but come the day that something nasty, however unintentional the nastiness can slip past it there's nothing in the way between generating the code and borking the end-points.

        1. sitta_europea Silver badge

          Re: Just bad luck?!

          The confusion here, that Quality Assurance and pre-release testing are even remotely the same thing, is why disasters like this happen.

          QA is about making sure that the product can't be supplied in a broken state by designing the processes which produce the said product.

          *Any* product.

          You can't inspect quality into a product after you've produced it.

          You have to design quality into the production process from the very beginning.

          It seems clear that nobody with any clout at Crowdstrike has any idea what Quality means.

          It wasn't just bad luck. It was inevitable.

        2. Michael Wojcik Silver badge

          Re: Just bad luck?!

          Regardless of what their dedicated test systems did, it's clear that they don't have a basic smoke test where they first push the update to a normal Windows system running the production Crowdstrike Falcon software and make sure nothing goes boom. Saying "our special test program didn't catch this" is rather beside the point; they don't test their own software in-house before pushing it to customers.

          Yes, they should have rolling updates. Yes, they should have much, much better QA. But the fact that they don't try their stuff, in the way customers run it, internally before publishing it, is a huge problem. That's completely inexcusable.

          1. Someone Else Silver badge

            Re: Just bad luck?!

            Something about eating your own dog food....

    4. Primus Secundus Tertius

      Re: Just bad luck?!

      The driver failed to cope with bad data. But there is a limit to how much checking you can put into an embedded driver. What failed was the so-called checking of the data, an offline process where the program has plenty of time to do a thorough job.

      Many software faults I saw in my career were due to bad handling of error cases.

      1. Andrew Mayo

        Re: Just bad luck?!

        That's not the case. There is an API specifically to allow a kernel memory access to be checked prior to actually touching the memory, to be sure it's a valid address. If the driver code had done this, then the faulty file would not have caused any issues at all. This is just poor quality engineering. You build resilience into systems at multiple levels, you don't rely on QA processes alone.

  2. Anonymous Coward
    Anonymous Coward

    CrowdStrike's remedy is to continue to test in production, but with staggered deployments

    Gartner analyst Jon Amato is an idiot. No wonder analysts still recommend buying CrowdStrike stocks. It wasn't bad luck. CrowdStrike pushed code/code-as-configuration directly to production, relying on unit tests only to catch bugs. Normal companies who invest in risk management would have unit testing and also deploy the code to test Windows environments and hire QA to test it, before pushing code to production Windows environments.

    Now CrowdStrike will fix their unit tests and have staggered deployments to customers' production environments, because they are too cheap to hire QA to do end-to-end testing. It's still testing in production.

  3. Pascal Monett Silver badge

    So, Kurtz was at McAffee when they bugged ?

    And he is now at CrowdStrike and they fuck up.

    This is starting to look like Kurtz should exit the IT security field and go into something less critical, like Pokemon.

    1. phils

      Re: So, Kurtz was at McAffee when they bugged ?

      If he brings Pokemon down he might face some real consequences.

    2. Doctor Syntax Silver badge

      Re: So, Kurtz was at McAffee when they bugged ?

      McAffee was an accident so this is just a coincidence. It will need a third time before we start to get suspicious.

      1. Someone Else Silver badge

        Re: So, Kurtz was at McAffee when they bugged ?

        Malcolm Nance said it best: "Coincidence takes lots of planning."

        Although in this case, it may prove to be that coincidence takes lack of planning...

    3. StewartWhite Bronze badge
      Joke

      Re: So, Kurtz was at McAffee when they bugged ?

      "This is starting to look like Kurtz should exit the IT security field and go into something less critical, like Pokemon."

      Shame CrowdStrike didn't apply the Pokemon ethos of "Gotta catch 'em all" to the bugs in Falcon.

  4. Anonymous Coward
    Anonymous Coward

    Slightly apropos

    https://xkcd.com/1118/

  5. MrBanana

    Sometimes it can be recovered, but not this time

    The response a company makes to a major problem can be very telling. If they respond quickly, do everything they can to fix the problem, and reassure the screwee that things have changed, it is possible to actually build on their reputation. I've been in Tech Support with the customer literally screaming down the phone on the call with the support team and VP on the line, threatening all sorts of legal action and public shaming. Meanwhile someone is booking my flight, taxi to the airport, I'll buy clothes etc when I get there. Get it right, and you end up with a customer even happier then before the problem. But...

    I can't see that happening here. The screw up was just too big. The recovery couldn't be accelerated by CrowdStrike doing anything other than publish the recovery process, which to their credit they did quickly. They will be a pariah from now on. If they could survive the publicity fallout, many of those customers will still be suing them and I doubt that CrowdStrike has the lawyers capable to hide behind their ELUA against such an onslaught. I would also guess that all the other AV pushers will be contacting the, now very public, customers of CrowdStrike to promote their alternative at a very affordable price. The association of the CrowdStrike CEO with a similar screwup at McAfee also doesn't help, he will have to go.

    1. Doctor Syntax Silver badge

      Re: Sometimes it can be recovered, but not this time

      What about those members of the general public who were screwees? They have accepted no EULA. They can claim to have been harmed by it. They can argue that any reasonable person would claim that the release process was so insufficient as t be negligent. The only defence Crowdstrike can offer is that they are too remote from the public as the EULA should have prevented their customers from deploying it on anything that mattered in the real world. Arguing that your product is unfit for which your customers might wish to buy it isn't a good look.

    2. DJV Silver badge

      Re: Sometimes it can be recovered, but not this time

      Upvoted with enthusiasm for the use of the term "screwee"!

    3. Michael Wojcik Silver badge

      Re: Sometimes it can be recovered, but not this time

      The screw up was just too big.

      Stock market says "nah" (the stock is still up year-on-year), and I agree. Technology is sticky. CIOs are mad today, but when they look at how much work it will be to replace Crowdstrike with some other vendor, they'll decide to believe the grovelling and promises to do better.

      The simple fact is that in IT companies very, very rarely get punished for screwing up. They get destroyed all the time for other reasons, of course — failing to predict the next wave of hype, scaring a bigger competitor, not being sufficiently sexy to attract the attention of the non-techies who hold the purse-strings. But causing disasters? That's treated as just another facet of the immaturity of the industry and quickly forgotten.

      And as I've noted before, Crowdstrike's USP of being particularly auditor-friendly for things like ISO 27001 is a significant moat.

      1. Richard 12 Silver badge

        Re: Sometimes it can be recovered, but not this time

        Stock market is probably wrong. Or rather, it's doing its usual short termism as the consequences won't hit them for a while.

        There is going to be a huge financial cost.

        The market has started to price that in to the insurers, but once the insurers start trying to recover their losses from CrowdStrike, there definitely isn't enough money in CrowdStrike to pay even 1% of the insured losses caused by their admitted negligence.

        Their insurance may cover some of it, but they will then become an uninsurable business, which then means other businesses cannot use them and remain insured...

        Then there's the possible criminal liability in some jurisdictions.

        It's likely that they'll be filing for bankruptcy protection, and their only asset is their name - which is now synonymous with the biggest IT disaster that's ever occurred. (So far)

  6. Headley_Grange Silver badge

    "I would also guess that all the other AV pushers will be contacting the, now very public, customers of CrowdStrike to promote their alternative at a very affordable price"

    True, but none of that changes the fact that businesses put critical infrastructure in place without having a fallback plan to cover the risks of outsourcing the OS and how updates to key files were done. They, too, did no testing and didn't seem to understand that testing might not even be possible with Crowdstrike's update policies. They had no fallback, no redundancy, no parallel systems - unless you count whiteboards at airports. They had put in place remote IT support which was useless in many cases because on-site support was required to restart PCs.

    Crowdstrike fucked up, but it was the sort of fuck-up that happens every day across our industry, the main difference being that for most of us the impact is getting a bollocking from the boss or a client and having to push out a quick fix with the upside being a potential "Who, Me?" article in the Reg.

    Getting a new AV supplier won't help until businesses take "critical" at face value and make their systems resilient.

    1. pip25

      Exactly what would you have done in their shoes? Buy Macs or Linux boxes as "backups"? Use multiple AV vendors? And how on Earth was a company like Ryanair supposed to know that their AV vendor was following insane deployment practices...?

      It's easy to complain about missing disaster recovery plans after a disaster, but a company will never be prepared for everything - that's part of why they're paying that AV vendor in the first place!

      1. Michael Wojcik Silver badge

        Being able to immediately spin up backup vApp images of your Windows servers would be a start. Making it possible for technical end users to get their Bitlocker keys themselves, rather than having to ask IT. Not letting Crowdstrike push updates to every single Windows machine in the organization more or less simultaneously.

        1. druck Silver badge

          If they immediately spun up the backup images, the first thing that would have happened is the bad update file would have been downloaded and crashed them too. They would have to wait until Crowsdstrike had pushed the fixed update, and then spun up.

          As for the bitlocker keys, most users would write them on a post-it note and stick it to the bottom of the laptop, which is great for security.

          But I will agree without over the staged updates.

      2. Headley_Grange Silver badge

        They only had to do two things - FMEA and Risk Management.

        I'd expect them to understand their single point failure modes and the commercial and technical performance of the products at the pinch points. I'd expect them to do a risk assessment and cost up the potential risk mitigations and then do a cost/benefit analysis. If they decided that the mitigations were more expensive than having their systems down for a few days then they doing nothing would have been OK - but it would have been their informed decision and they can't whinge at Crowdsource, MS, or whoever when the event happens exactly as they could have predicted.

        I bet that less than 10% of the businesses affected had a serious attempt at this. I bet there are IT bods out there who are digging out emails from years ago where they warned about risks like this and were fobbed off with "what do you expect us to do about it - have two OSs?".

    2. Ken Hagan Gold badge

      (Edit: This was supposed to be a reply to pip25. Soz for the confusion.)

      I'd start by insisting on a product that lets *me* choose the roll-out schedule.

      Then I'd like a test environment that I could roll out to first, but if my own bean counters say no to that then I would at least split my production environment in two so that I only borked half the company.

      And if I then find that my vendor has inserted a back door for updates that they consider too important for little old me to defer, then I will consider my legal options. (It's flat-out criminal in some places.)

      1. Someone Else Silver badge

        I'd start by insisting on a product that lets *me* choose the roll-out schedule.

        Oh, you mean like allowing me to manage the rollout of OS software updates? Like Micros~1?? Yeah, like that's gonna happen.

    3. hoofie2002

      Nailed It

      Hammer on Nail interface. We did a security exercise a few days before where we asked some business units what they would do if we got hit and they lost some of their compute resources. Cue blank looks and fidgeting. Not encouraging.

  7. Grunchy Silver badge

    …I always assumed “Crowdstrike” was some kind of regrettable catastrophe involving George Russel and a Mercedes.

    1. Doctor Syntax Silver badge

      I'd seen it there but never bothered to Google it. Now I don't have to. Such is the power of advertising.

  8. Mike 125

    On the upside, Asterix gets a lot of mentions in that YouTube clip.

    That must be Gauling for Mr Kurtz.

  9. steven_t
    WTF?

    "Operates the way CrowdStrike does"

    Amato said. "This could have happened to literally any organization that operates the way CrowdStrike does."

    No organization, let alone one offering endpoint protection for business customers, should operate the way CrowdStrike does.

    It releases configuration changes to all of its customers without testing whether the changes achieve their aims. This seems to be its policy - CrowdStrike's explanation of the incident doesn't say anywhere that someone should have tested whether the new channel file/template instance achieved its aims.

    Some organisations have a policy of testing changes before rolling them out but, in an emergency, they sometimes reduce or even skip the testing step. That's not what CrowdStrike says happened. Their normal process seem to come up with a new thing they want to monitor, change a configuration file in a way that looks like it should detect that new thing, run it through the validator and unleash it on their customers, without actually testing whether it actually detects that new thing... or detects false positives... or breaks existing functionality... or degrades performance... or crashes the machine.

    I appreciate it isn't easy to test whether the change correctly identifies malicious named pipe behaviour because you either need to use a malware sample or a malware simulator program, but it isn't rocket science. I'm sure CrowdStrike could afford to employ someone who knows how to do it. It could also afford to employ a manager who knows how important it is.

    1. Doctor Syntax Silver badge

      Re: "Operates the way CrowdStrike does"

      A more basic test it to ensure that it doesn't bork the machine on which was deployed. Was it only some subset of customers' machines that would have been affected?

      1. steven_t

        Re: "Operates the way CrowdStrike does"

        From what I've read, it affected all Windows hosts (both physical and virtual machines) that had CrowdStrike installed, so a very simple test process would have spotted that. A simple test wouldn't have spotted the dozens of other ways a configuration change can go wrong.

        1. Michael Wojcik Silver badge

          Re: "Operates the way CrowdStrike does"

          It affected all Windows machines running Falcon in the window between the '037 release and the '038 release. Some systems didn't download until after '038 was released (per the article, 78 minutes after '037), and so when they rebooted they read the working '038 rather than the broken '037.

          So, yes, if the idiots at Crowdstrike had bothered to install each update on a single Windows machine and reboot it before publishing, they'd have seen the problem. If only there were some way to do that with some sort of "virtual" computer and automate it in some kind of "continuous" fashion!

    2. egrep
      Alert

      Re: "Operates the way CrowdStrike does"

      This could have happened to literally any organization that operates the way CrowdStrike does, testing in production.

      But first testing in production is never considered best practice. Amato and the article's author accept CrowdStrike's PR statement at face value, not realizing how they operate is more like a move-fast-and-break-things startup than an enterprise. An enterprise operating like a startup isn't the flex that people think it is. It means there is a lot of tech debt.

      1. CapeCarl

        Move-slow-and-eat-things

        Didn't Kings of yestercenturies have "canary food testers"..."Trust not what the kitchen sends you" (or perhaps in US 1980s terms "Trust, but verify".)

        Would seem that this "provider / client" defense pattern was laid down a long time ago.

  10. Howard Sway Silver badge

    it sent $10 Uber Eats gift codes to its over-worked partners and colleagues

    Sorry we fucked up your billion dollar global business. Have a free burger.

    The root cause of the disaster was a classic lack of input validation : assuming the input file was well formed, not bothering to check that it contained a value and thence causing an invalid memory access using an undefined pointer. This is such a basic error that it's generally one of the first things you learn to guard against as a programmer. It raises serious questions about the quality and robustness of the rest of their code, which I would not be trusting from now on until it had undergone a serious external audit, were I to be using it.

    1. Dan 55 Silver badge

      Re: it sent $10 Uber Eats gift codes to its over-worked partners and colleagues

      ... then some people found that their codes were cancelled on trying to redeem them.

      If only Crowdstrike were as quick at cancelling bad updates.

    2. Michael Wojcik Silver badge

      Re: it sent $10 Uber Eats gift codes to its over-worked partners and colleagues

      I think it's a mistake to label any one problem the "root" or "real" issue here. Yes, crap input validation is definitely a problem with Falcon, and everyone at Crowdstrike should be forced to read The 21 Deadly Sins of Software Security and pass a test on it. (Is 21 the current version? I didn't look it up.)

      But not testing their own software in its real environment is also a big problem. And not having comprehensive unit tests (and proving that with code coverage analysis) is a big problem. And not doing fuzz testing is a big problem. And not doing rolling updates is a big problem.

      Others have argued that Falcon has too much running in kernel mode, and that's arguably a big problem too, though it's not clear just what the performance hit of having better kernel/user-mode separation would be for Falcon, or whether they could be using ETW rather than a driver, etc.

      There are many problems here. It's a cascade of ineptitude. Let's not focus on one issue to the exclusion of all the others.

      And perhaps the biggest is that they'll quite likely get away with it.

    3. Richard 12 Silver badge

      Re: it sent $10 Uber Eats gift codes to its over-worked partners and colleagues

      The root cause is that they do not have any QA whatsoever.

      This wasn't even "works on my machine" level of untested.

      It literally hadn't been tried on a single running kernel before they forced it onto every single one of their customers.

      And that's their normal procedure.

  11. BenMyers

    An oxymoron

    Given the McAfee XP disaster in 2009 or thereabouts and the recent CrowdStrike debacle, saying the name of ex-McAfee CTO and CrowdStrike founder George Kurtz and "computer security" in the same breath is an all-time great oxymoron.

  12. koborn

    Monoculture

    Setting aside who actually caused the problem and how, this is yet another example of the insane persistence of a monoculture that relies on Microsoft products for everything. Even if they are the best/cheapest/simplest etc (and all of those are *very* debatable) it is simply incompetence of the highest order to use a single product set everywhere.

    The industry uses the term "anti-virus" and similar terms freely. It would be well worth looking at medicine, agriculture, and general biology to try and get even the most basic idea of why variety is essential. Do not rely on one crop variety, one type of drug, and so on. Everyone knows this outside our industry - and some inside.

    Working in the ISP world I noted that MS products were used sparingly and for less-critical functions. A rich ecosystem of other OS and application products may make it harder to maintain day to day, but for sure it's a lot less vulnerable.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like