back to article CrowdStrike meets Murphy's Law: Anything that can go wrong will

CrowdStrike's recent Windows debacle will surely earn a prominent place in the annals of epic tech failures. On July 19, the cybersecurity giant accomplished what legions of hackers could only dream of – bringing millions of Windows systems worldwide to their knees with a single botched update. As a veteran tech journalist, I' …

  1. Anonymous Coward
    Anonymous Coward

    I spent the afternoon watching the ticket system fill with alerts saying that crowdstrike had quarantined the dell update utility on every computer at one of our big clients.

    The migration of the IT estate from us to the team who mandated that crap was halted because of huge issues caused by CrowdStrike... before last weeks outage...

  2. Michael M

    There but for the grace of God go all of us

    1. Martin Gregorie

      People, including managers, do get careless

      About the first lesson I learnt when I first became a developer, was not to be careless, which covers everthing from TESTING your code adequately to having your designs (at all levels from subroutines to complete systems) criticised by your peers and all levels of users before you even think of implementation. And you keep all the documentation written before and during system development and modification. Fortunately I had all that stuff drummed into me while I was still writing assembler and COBOL modules designed by others in an ICL service bureau. This approach has always served me well at all levels of the system maintenance and development.

      I'm also aware that there are still many developers and and code shops where this doesn't happen: special props to Smith's Industries where the systems analysts binned binned documentation once anything had gone live! - but I never dreamt that that this level sloppiness was still commonplace in this century, or that anybody, anywhere, was as sloppy and incompetent as CrowdStrive has turned out to be - except,of course, for Microsloth.

      .

      1. Michael Strorm Silver badge

        Re: People, including managers, do get careless

        > I never dreamt that that this level of sloppiness was still commonplace in this century

        On the contrary, such "sloppiness" is exactly the direction we've been moving *towards* in the 21st century- the epitome of the modern attitude towards testing in the "move fast and break things" and "why pay QA testers when we can get end-users to be the guinea pigs for free?" age.

        1. Sudosu Silver badge

          Re: People, including managers, do get careless

          So the CrowdStrike deployment was a complete success (autocorrect wants to change it to Crowd Trike, how fitting)

          - Move Fast, check

          - Break Things, check

      2. rg287 Silver badge

        Re: People, including managers, do get careless

        but I never dreamt that that this level sloppiness was still commonplace in this century,

        I was at a social meetup jobbie last month where IIRC we were discussing the evils of copilot/AI-generated code and the risks associated with it. The point was of course made whether it mattered if you were checking your commits properly - dodgy code would be refused.

        A young, recently graduated dev chipped in "I mean, most companies do code review before..." at which point half the room snorted beer out their noses.^1

        Alas no. At least not unless it's a regulatory environment (and even then, Boeing are showing how compliance isn't a panacea for good dev culture).

        ^1 Probably worth mentioning that it was a web-dev crowd, not systems developers. And most of them do evangelise testing, but the nature of web being what it is, management don't always facilitate the time or resource).

        1. Sudosu Silver badge

          Re: People, including managers, do get careless

          I think society as a whole has lost the concept of cause and effect...

          For almost two decades I worked on a team that transformed a large organization into using the ITIL framework (framework because you adapt it to your organizational requirements)

          We achieved a decent (it is never perfect) balance between flexibility of change and protecting the organization from the "cowboy" mentality as we used to call it.

          A few years ago someone up high enough was talked into implementing Dev Ops, which in this case meant the developers could do what they wanted without participating in the change management processes because it was slowing them down.

          What followed included numerous outages and impacts due to the many dev teams stepping on each others and the administrative teams toes, including several security breaches.

          The proffered excuses was usually "well, it worked fine on my machine"

          I realized my culture at that organization had been replaced by the cowboys who preceded my tenure, so I left for another organization where I could help them reduce their outages and they understood cause and effect.

          I do feel for the end users at my old gig, but I really enjoy helping my new team turn from spending all of their time being reactive to becoming more proactive, while hopefully lowing their stress levels.

          1. 0laf Silver badge

            Re: People, including managers, do get careless

            No I think now it's a loss of any sense of responsibility, or a denial of any responsibility.

            CEO's blame everyone

            Managers blame board and devs

            Devs blame managers and customers

            Customers blame suppliers and admins

            Admins blame suppliers and users

            User blame everyone

            Blame costs nothing. Responsibility is expensive.

          2. rg287 Silver badge

            Re: People, including managers, do get careless

            A few years ago someone up high enough was talked into implementing Dev Ops, which in this case meant the developers could do what they wanted without participating in the change management processes because it was slowing them down.

            There's a balance to be struck as well. One of the regulars at that meetup did a talk a while back about their CI/CD pipeline. They work on a "little and often" basis with developers encouraged to commit after every little thing - not wandering off and doing a massive merge request touching hundreds of files after a week of work. If they're mopping up little jobs, they can do multiple commits before lunch.

            Their commits get pushed into an amazingly comprehensive pipeline of automated unit tests and then end-to-end tests. If it fails at any point it gets kicked back without making it to production (and the big dashboard on the office wall goes red until the developer fixes it!). Because they commit little and often they know exactly what the issue is and aren't wracking their brains trying to work out what it was they were doing last Tuesday.

            For day-to-day dev work, they don't have a formal change committee or anything - the testing pipeline they have set up is sufficiently robust, and they also do a lot of pair programming which in their experience gives them more robust code to start with. They only need to do more formal change control on big architectural changes.

            They're not formally deploying ITIL, but I suppose in many respects it amounts to the same - they've built a framework in which changes are scrutinised and developers are accountable for their code.

            They of course are the exception in having made that investment in testing and their CI/CD pipeline. Most "DevOps" implementations are - as you say - an excuse for developers to test in Prod.

        2. The Oncoming Scorn Silver badge
          Pint

          Re: People, including managers, do get careless

          "at which point half the room snorted beer out their noses"

          That's uncomfortable, snorting beer out your nose with chunks of your strawberry cheesecake desert is downright painful.

      3. Ozan

        Re: People, including managers, do get careless

        In my long work life, all of my safety trainings, there was one thing always stated: people cause mo incidences in routine, every day work than anything else. You get careless when you do something, same thing every day.

    2. heyrick Silver badge

      There but for the grace of God go all of us.

      Except, if God actually dropped it into a sacrificial lamb and saw it BSOD there before pushing it out to millions of machines, this would be a non-story.

      This whole debacle just screams of a lack of basic test-prior-to-release.

      1. ecofeco Silver badge

        Re: There but for the grace of God go all of us.

        Oh it's even worse: failure to check version release.

  3. ecarlseen

    The collapse of what little engineering culture existed in IT

    It's always been a stretch to apply to the word "engineering" to IT (many of us who tried very hard seem to be found in these forums), but over the last decade or so we've seen the continuous release of new levels of farcical.

    Virtually no IT vendors care about stability anymore - they're too busy propping up the house of cards that is IT corporate stock P/E ratios. Upgrade constantly to provide additional "value" to customers (completely ignoring the costs and risks of every change, because that's the customers' problem amirite?>), because if you don't upgrade constantly you can't justify the recurring revenue required to justify the IPO / acquisition / trading price.

    Change is not inherently bad, but change is cost and change is risk and if you're adding cost and risk you need to pair it with a corresponding amount of value. This has always been sketchy, and so customers stopped automatically upgrading because the value wasn't there. So now upgrades are forced on us. This will not end well.

    1. TReko Silver badge

      MBA culture replaced engineering culture

      True there's no longer an engineering culture, it's all short-term cost saving by people who don't understand the technology.

      Engineers are now a costly "resource" like paperclips. MBA's regard them as fungible. Why pay for expensive engineering resources when someone with the same title can be sourced in India for $9 an hour? Boeing did with the 737 MAX ACAS software, Crowdstrike did it in February.

      All this ignores that engineering skills do matter, testing is crucial in a complex environment. The MBA needs to learn that even if they choose to ignore reality, reality will not ignore them.

      1. Will Godfrey Silver badge
        Thumb Up

        Re: MBA culture replaced engineering culture

        "The MBA needs to learn that even if they choose to ignore reality, reality will not ignore them."

        This line should be tattooed onto their foreheads.

        1. Michael Strorm Silver badge

          Re: MBA culture replaced engineering culture

          They *can* ignore it so long as they make sure they're not the ones that have to suffer the consequences.

          Heads may roll at a higher level than usual with Crowdstrike due to the egregiousness of their fuck up- or rather, due to its effect on their share price- but that's very much a rare "worst case" scenario.

          And even then I wouldn't be surprised to see those whose heads roll being carefully-chosen scapegoat(s) with those actually at the top being blamed minimally if at all and quietly moving on elsewhere with minimal effect on their careers.

          1. ecofeco Silver badge

            Re: MBA culture replaced engineering culture

            LOL! Nobody in upper management is going to suffer any consequences. Surely you guess who the usual scapegoats will be?

        2. ITMA Silver badge
          Devil

          Re: MBA culture replaced engineering culture

          I'd profer that they need Feynman's classic line tattooed across their foreheads:

          "For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled"

          1. Dan 55 Silver badge

            Re: MBA culture replaced engineering culture

            That line is unfortunately disproven by all of us here spending our days fernangling with crap on x86 boxes. There were plenty of better timelines that could have been taken but we're in this one.

        3. Anonymous Coward
          Anonymous Coward

          Re: MBA culture replaced engineering culture

          With a 1000 MW laser.

        4. rafff

          Re: MBA culture replaced engineering culture

          "This line should be tattooed onto their foreheads."

          Nah. They can't see it there; tattoo it on their pricks.

          1. The Oncoming Scorn Silver badge
            Joke

            Re: MBA culture replaced engineering culture

            Along with the words....

            "This Way Up"

          2. A.P. Veening Silver badge

            Re: MBA culture replaced engineering culture

            Nah. They can't see it there; tattoo it on their pricks.

            Equally invisible due to the small size*) and the roof covering it.

            *) They need that MBA title to compensate.

      2. Tim99 Silver badge

        Re: MBA culture replaced engineering culture

        Global capital may be deciding that sourcing from India is now too expensive, and that South/East Asia or some African countries can come in at about 2/3rds of that...

        1. ecofeco Silver badge

          Re: MBA culture replaced engineering culture

          Already happening:

          https://www.indiatimes.com/news/india/indian-techie-narrates-how-they-are-being-replaced-with-vietnam-developers-638881.html

      3. ecarlseen

        Re: MBA culture replaced engineering culture

        Yes. I refer to this behavior as "MBAing the company to death."

      4. 0laf Silver badge
        Pint

        Re: MBA culture replaced engineering culture

        "MBA needs to learn that even if they choose to ignore reality, reality will not ignore them"

        That phrase is a keeper, beers for you

    2. yoganmahew

      Re: The collapse of what little engineering culture existed in IT

      It's made worse by security companies flagging open source packages as "untrustworthy" if they haven't had a new version in a while. So everything has to be updated all the time.

      Anyway, I've been saving up Security Now, so off to listen to that! :D

  4. Howard Sway Silver badge

    Users were left scrambling for answers while critical infrastructure faltered

    I have seen this happening for years on a smaller scale within companies, mostly at the level of server applications, despite warning of the stupidity of the concept of continuous deployment. In almost every case, development and support were kept apart as seperate worlds, meaning that when the shit hit the fan, poor lines of communication slowed down the response way more than should have been the case.

    I have come to the conclusion that whoever invented / pushed continuous deployment has never worked in IT support for a day of their lives, otherwise they would have understood the risks of that idea on a deep and fundamental level.

    1. Crypto Monad

      Re: Users were left scrambling for answers while critical infrastructure faltered

      Proper Continuous Deployment depends on:

      (1) A comprehensive automated test suite;

      (2) A pipeline which never deploys anything unless the entire test suite passes;

      (3) Phased deployment (i.e. canaries)

      (4) Instrumentation so you can see no unexpected changes in the behaviour of the canaries

      If you're not doing these, you're not doing CD, you're doing crash-and-burn.

      1. Doctor Syntax Silver badge

        Re: Users were left scrambling for answers while critical infrastructure faltered

        And what about liaison with support to find out whether Right Now is a good time operationally for this particular deployment? It's they who will have to cope with any changes of behaviour and if there are no changes why are you deploying it?

        1. BeansB

          Re: Users were left scrambling for answers while critical infrastructure faltered

          I was a senior support engineer at a company where I had final authority on new releases shipping or not (for the products I covered). If I said no-go it didn't release on that day. At least that was the theory but as we all know, theory and reality are very different things. The one time I did that (because the software was not ready and had actual bugs that I didn't feel were appropriate to ship with) the sh!t and the fan met quite quickly.

          The end result was that I never held another release not because I hadn't been able to justify my previous hold but because the storm that arose was such a PITA I didn't want to deal with it again. Just another example of companies giving lip service to having the right policies in place while making actual implementation of them so painful as to make it pointless.

          1. Jamie Jones Silver badge

            Re: Users were left scrambling for answers while critical infrastructure faltered

            ...making toy the fall guy when a bugged release gets out!

            Damned if you do...

      2. Julian Poyntz

        Re: Users were left scrambling for answers while critical infrastructure faltered

        Sounds like crowdstrike believe they had (a)

        1. Billy Twillig
          Joke

          1 to (a) correspondence

          If your post had zero-level kernel access (and was written in C#) would you get a BSOD from using the wrong ordering schema?

          “(a) ≠ (1)”

          Posts should be tested with the same rigor applied to software…uh, oh yeah, you did…

          /jk

    2. Doctor Syntax Silver badge

      Re: Users were left scrambling for answers while critical infrastructure faltered

      "In almost every case, development and support were kept apart as separate worlds"

      I had the good fortune, when I started in IT, to have a team that did both with in-house S/W. It's always seemed to me that where that could be managed* it was the better system. The worst arrangement was where Unix admin & DBA were not only separate teams but we didn't even set eyes on each other.

      * Obviously if it was bought-in S/W the customer's support/operations are going to be separate from the vendor's developers**

      ** I have, however, ended up debugging a vendor's code that was crashing a client's system but that was because the product was source-available

  5. Anonymous Coward
    Anonymous Coward

    Windows: a flawed security model

    An OS that required a third party app to open and test each-and-every file for maliciousness before passing it to Userland. An OS that can't tell the difference between OPEN and RUN. Functionality mixed through-out the OS such as it's impossible to successfully debug or remove. Designed to make it difficult to clone or test for bugs. Windows security is unfixable.

    1. Ken Hagan Gold badge

      Re: Windows: a flawed security model

      "An OS that required a third party app to open and test each-and-every file for maliciousness before passing it to Userland."

      Ironically, for many users, both home and office, their AV package is the only non-Microsoft software that isn't running entirely in user-space.

      "Designed to make it difficult to clone or test for bugs."

      I'd say "Evolved" and like its natural counterpart, the evolution has no particular purpose.

      1. Snake Silver badge

        Re: Windows: a flawed security model

        Ironically, for many users, both home and office, their AV package is the only non-Microsoft software that isn't running entirely in user-space.

        100% not true. Your statement deserves a simple one word reply:

        "nVidia"

      2. mark l 2 Silver badge

        Re: Windows: a flawed security model

        If your a PC gamer you probably have some anti cheat software that runs at the kernel level, because apparently these game publishers think stopping a few saddos who want to cheat at a game is more important than the security and stability of millions of Windows PCs

        1. Sudosu Silver badge

          Re: Windows: a flawed security model

          That is why, even though it works, I do not run the EA App (their launcher) in Proton under Linux...if it can't run as expected they will likely ban your account.

          It is annoying as this is the last holdout for me having a a Windows desktop at home.

    2. david 12 Silver badge

      Re: Windows: a flawed security model

      An OS that can't tell the difference between OPEN and RUN.

      You get that far, and then I realize that you've got no idea what you are talilng about.

      Yes, the security model is more sophisticate than what you are used to on linux/BSD. No, that doesn't mean what you assert that it means.

    3. jeremya
      FAIL

      Re: Windows: a flawed security model

      Windows does have an adequate security model and doesn't require anti-virus.

      The problem is that using the built-in Discretionary Access Control and Policy systems requires skill and a willingness to accept some pain till you get it right.

      However, one criticism could be that it has Discretionary rather than Mandatory Access Control as provided by SELinux. That was an original design decision that has had the inevitable results we see now.

      It is bleeding obvious that systems *will* be compromised despite firewall and anti-virus protection.

      The CrowdStrike fiasco illustrates how security managers have focused on preventing attacks and have neglected exploit containment and system recovery.

    4. Steve Channell
      Mushroom

      Re: Windows: a flawed security model

      This really has nothing to do with the Windows security model - it's the history: When the internet became as retail experience in the early 1990's Windows had a TCP/IP stack based on DCE and wasn't secure enough to prevent attack - compounded by MS decision to add IIS remote debugging on the server and ActiveX on the client. Windows rightly got a bad reputation for poor protection, creating a market for 3rd party virus protection.

      MacOS is better protected than windows because [1] it shares the micro-kernel design of the NT Executive (using messages rather than stack frames to call kernel functions) [2] almost nobody used it when virus attacks exploded, [3] Apple had early experience of virus attacks when it was especially vulnerable due to floppy disk viruses, and learned the lesson.

      CrowdStrike problems are only just starting: It would be reasonable to blacklist their csagent.sys kernel driver as malware, destroying their business. That we're not (yet) talking about bankruptcy is a testament to two things: [1] their well-funded legal department, [2] their well-funded marketing department.

      While MBA led tech companies are poor for engineering, the modules on legal and marketing are well used stopping the truth from getting out

  6. Doctor Syntax Silver badge

    "It wasn't even a code problem."

    It wasn't just a code problem but there were certainly two code problems. One is mentioned in the article - the validator. But there was also a problem with the code running in the kernel - it didn't check the duff data and reject it. Maybe it didn't check at all, maybe it had the same code as the validator but one thing is sure - code running at that level should be very defensively written. Every malware slinger out there now knows that, on the evidence of this, that if you can get something that looks like a channel file into your victim's Crowdstrike directory there's a good chance it will be picked up unchallenged.

    1. ecarlseen

      Not to mention a flag to detect failures during an update and rolling back after a crash. That's not even a hundred lines of code.

      1. newspuppy

        Rolling back automatically... was done in 1995....

        By a company called NetAngels.... with software from The Polished Group....

        Their 'netangel' (a personalisation system to help find people, places and products ) via any browser had such a system, that, if an upgrade failed.. would restore to the old version that worked. And use telemetry to push the fault and fault state (stack, etc) back to the update server. The whole point was that users would always have a working 'netangel' on their system....

        But that was in the day when programmers would do design, and testing, and documentation, and care about the final product.... Something few are taught today.. More trained in the way big tech companies do it: Run Fast and break things.....

        :(

    2. Anonymous Coward
      Anonymous Coward

      I wonder if it's possible to say that slowly enough for a "manger" to understand?

      The software crashed because of it's own configuration file !

      That is AWESOME.

      I couldn't manage that if I tried.

      1. Michael Wojcik Silver badge

        Really? Because a great many programmers have. That's why fuzz testing has historically been so successful at finding bugs.

    3. gnasher729 Silver badge

      We don’t even know if there was anything wrong with the configuration file. We know the code reading it crashed. It is absolutely possible that it crashed on a correct configuration file. Of course it shouldn’t have crashed with any input whatsoever.

      1. katrinab Silver badge
        Alert

        Wasn't the configuration file all 0s or all nulls or something like that?

  7. Pascal Monett Silver badge

    "a sobering wake-up call"

    There is no longer any such thing.

    I do not think that CrowdStrike will fold. Too big to fail comes to mind. Too many megabuck multinationals do not want to upheave their flawed systems. They will prefer that Kurtz makes his mea culpa, makes a ton of promises, shows transparency and a ton of bullshit, and they will stick with what they know.

    If CEOs were capable of venturing into the unknown, every business PC would be running Linux.

    1. Anonymous Coward
      Anonymous Coward

      Re: "a sobering wake-up call"

      Microsoft and Crowdstrike and any other big vendor will throw money at politicians and bureaucrats who will impede any investigation:

      https://www.propublica.org/article/cyber-safety-board-never-investigated-solarwinds-breach-microsoft

      1. Anonymous Coward
        Anonymous Coward

        Re: "a sobering wake-up call"

        > Microsoft and Crowdstrike and any other big vendor will throw money at politicians and bureaucrats who will impede any investigation:

        The President Ordered a Board to Probe a Massive Russian Cyberattack. It Never Did.

        After Russian intelligence launched one of the most devastating cyber espionage attacks in history against U.S. government agencies

        CyberBS ..

    2. This post has been deleted by its author

    3. ecofeco Silver badge

      Re: "a sobering wake-up call"

      This is EXACTLY what will happen.

      Some kabuki theater and then business as usual. e.g flaming train wreck at high speed continuing off the cliff.

  8. short a sandwich

    As the saying goes

    To err is human, to really mess things up one requires a computer!

  9. Maximus Decimus Meridius

    Canary Deployment

    I have heard a lot over the last week about phased rollout and canary deployment, but no-one has answered the following:

    Who chooses the canary? Do the users opt in, or is it random chance that a particular system is chosen?

    If it is random, will it be as obvious what the cause of a BSOD is?

    Again if random, how would a company feel about being a guinea-pig? Yes, I understand everyone's a guinea-pig at the moment but you get my drift.

    If opt-in, who would be mad enough to do so? Yes, secondary or test systems, but in reality, it's a case of "let someone else take the risk"

    If there is a new threat detected and in the wild, who takes responsibility for a system infection while the updated software is gradually being rolled out?

    Lots of other issues with this I am sure.

    Not defending Crowdstrike, just want to know if the solution is truly better with no downside.

    1. yoganmahew

      Re: Canary Deployment

      Good questions, but the first canary should be within Crowdstrike's walls as part of their integration process. It seems every system that got the update failed, so they don't even have to do anything fancy with boatloads of windows versions and configurations. A linter (which appears to be all they had) is a poxy attempt at testing and most developers wouldn't consider linting to be testing. It's just shy of "if it compiles, it works".

      1. Anonymous Coward
        Anonymous Coward

        Re: Canary Deployment

        Microsoft - and I'm sure others do - boast about "eating their own dog food". They roll out all updates internally first before going near customers.

      2. gnasher729 Silver badge

        Re: Canary Deployment

        The update was running on a big standard desktop PC. So all that needed doing was give the developer a standard desktop PC, install the upgrade, and see what happens. It would have crashed of course.

      3. A.P. Veening Silver badge

        Re: Canary Deployment

        but the first canary should be within Crowdstrike's walls as part of their integration process

        Correct.

        And the second canary should be a small subset of PCs in large companies. This is already the case for "normal" updates, but CrowdStrike bypassed that line of defense and pushed it to all PCs (and servers) simultaneously.

    2. Ken Hagan Gold badge

      Re: Canary Deployment

      You nailed it yourself when you mentioned test systems.

      Users are given control over the update policy. Sensible users set up canary systems that are as close as possible to what that user is running in their own enterprise.

      It's what business users already do for Windows updates and if your AV vendor doesn't support that kind of operation then it's code running on your computer without your permission.

      Or "a virus" in common parlance.

      1. Doctor Syntax Silver badge

        Re: Canary Deployment

        "Users are given control over the update policy."

        Except that when the S/W fetches its own updates automatically, which I understand to have been the case here, that's a bit trickier. Some subterfuge might be necessary. Say your firewall is set to block the update server most of the time. Then it's opened for a time to let the canary update. If the canary remains perched then you can open it for the production servers for a short period. There's still the possibility of a race condition when an update is released between the closing of the canary window and the closing of the production window.

        1. Anonymous Coward
          Anonymous Coward

          Re: Canary Deployment

          From what I know of Crowdstrike, you would also be blocking any malware reporting by blocking the updates.

          However, Crowdstrike appear to be making the right noises about canary tests/deployments and admin control of regular channel updates. Whether that's enough, everyone will decide for themselves.

      2. Anonymous Coward
        Anonymous Coward

        Re: Canary Deployment

        > Users are given control over the update policy. Sensible users set up canary systems that are as close as possible to what that user is running in their own enterprise

        Which is what we did with the outgoing AV at our place which is being replaced by CrowdStrike......<bang!>

      3. Dr Dan Holdsworth
        FAIL

        Re: Canary Deployment

        The problem with giving users the ability to refuse updates is that a minority will refuse ALL changes to a system on the not-completely-insane basis that if the machine is running OK now, why alter it?

        Users generally need to be told that updates will happen, and not ever asked about giving their permission, or this sort of stupidity will occur.

    3. Julian Poyntz

      Re: Canary Deployment

      When I did this, we had our own vms of critical systems where we would deploy and do basic testing (did the os die, did any apps die, can the os reboot, etc)

      We then deployed to pilot and then test before production

      Local it teams and for different opco's, put the relevant machines (server and user) into those ad groups for us to test on - as they know, we did not

      Worked for us in those days

    4. Anonymous Coward
      Anonymous Coward

      Re: Canary Deployment

      There at least three kinds of controls to release software safely:

      1. Unit testing (partial testing)

      2. End-to-end testing (full testing in a non-production environment)

      3. Canary/staggered deployments to production

      Many companies do all three. Previously, CrowdStrike only did #1, but there was a bug in their unit test that didn't catch their application bug. After the disaster, CrowdStrike vowed to also do #3 moving forward. Noticeably, they did not vow to do #2. For software engineers who work on critical systems and have been doing all three, this looks negligent and cheap.

      Canary/staggered deployments are traditionally random, but usually it's in combination with additional safety controls like #1 and #2, or if #2 is not done, canary deployments are done on non-critical systems. CrowdStrike appears to think their Falcon sensor doesn't touch critical systems, or that potentially crashing the systems of some customers is worth the cost savings, as #2 would cost more money.

      It's possible that some companies would volunteer to take on unnecessary risk for their vendor and be the guinea pig for no benefit to themselves, but in the long run, these risk-seeking companies would extinct themselves. Another possibility is that CrowdStrike would select low-value, low visibility customers to test on so that customer complaints wouldn't hurt their reputation or too much of their revenue. They did not explain if their canary targets will be random, opt-in, or strategic.

      1. HappyDog

        Re: Canary Deployment

        "They did not explain if their canary targets will be random, opt-in, or strategic."

        Probably the "Free" tier.... with ads :(

      2. gnasher729 Silver badge

        Re: Canary Deployment

        “ 2. End-to-end testing (full testing in a non-production environment”

        Production or not is only relevant for a product running on a server. Amazon doesn’t want all sales to stop because testing some server update fails. But this software runs on end-user machines. Give every developer a cheap end-user machine.

        1. Anonymous Coward
          Anonymous Coward

          Re: Canary Deployment

          Absolutely force developers to use the same systems that the hoi polloi use, app tested by ITS on 16 gig ram computers with ssd's. Fell over when run on the employee computers that came standard with 8 gig ram and standard disks. Heard one of the head of IT claiming that 16 gig and ssd's were standard for the company until someone pointed out the form for purchasing standard pc's for the company was 8 gig and standard disks. Had talked to someone in the IT department and asked what they used for testing and was told "standard approved computers". Asked point blank if they had more that 8 gig ram and used ssd's which he denied. Some older machines still in use only had 4 gig ram and smaller/slower drives. Machines just locked up when the app distributed itself. all ram used and constant paging meant no progress by app. they kept sending me updates that would fix the problem, haa. 6 months later they scrapped the automatic installation of the Microsoft application. it's still available though and even with more ram, faster cpus and ssd's it will crash or lock up your computer.

        2. Anonymous Coward
          Anonymous Coward

          Re: Canary Deployment

          Nowadays, we can have virtual machines (VMs) running Windows, on the cloud or locally, which would function as test/non-production environments. Applying the change to any test Windows VM would have replicated the blue screen of death.

    5. gnasher729 Silver badge

      Re: Canary Deployment

      My company did this, and we used a little country two hours ahead in time. Sorry guys :-(

    6. matjaggard

      Re: Canary Deployment

      Actually Crowdstrike does have exactly this type of system - customers tend to opt for earlier releases using test systems. However CS allow some updates to skip that process and this was one of those. Many customers only found out about this skipping when their systems crashed.

  10. JLV Silver badge

    EU blaming...

    The Risky Business podcast had a field day with CrowdStrike. They have a running joke that there is a curse that whenever one of the 2 goes on vacation there is guaranteed mass outage that will clog their stuff-to-cover on return. I had caught them saying one was off for winter vacay right before...

    Anyways, they did cover the EU bit. Seems the EU told MS they had to give a level playing field because they couldn't have their own software (Defender?) running in kernel while keeping others out. Competition and all that. So "big players" - like CrowdStrike - got the kernel/ring 0 access, after signing the requisite forms swearing over their first born, pledging to commit seppuku and listening to Taylor Swift albums on repeat if they failed.

    RB went on to say that MS has been on and off - somewhat desultorily according to them - working on kernel Windows API frameworks that would allow security software to define rules (like the infamous 291 file) that would then be passed on, from an AV/EDR in userland, to the API for verification. They expect that effort to be gently encouraged along by regulators after this seriously unfortunate incident.

  11. Dizzy Dwarf

    Blameless Post Mortem

    Gotta wonder how that went.

  12. Anonymous Coward
    Anonymous Coward

    I once worked in an org who had the following

    1. Segregation of Duties. Nobody has access to both produce the code and run it.

    2. All code was peer reviewed

    It was great. Yes it did mean things sometimes looked slower. But as long as processes were documented and understood it was almost seamless.

    I've seen recent workplaces where the 90's look like a period of quiet competency.

    1. JFDI rules

    2. If something isn't documented, make it up on the fly

    3. Everyone has all access even if it's not required

    I'm not surprised that CrowdStrike managed to bork so many servers. I'm more surprised that it took so long. And that anyone was surprised by it.

    1. An_Old_Dog Silver badge

      Whaaaat?

      Nobody has access to both produce rhe code and run it.

      That makes it tough for programmers to test their own work; such testing is (or should be!) part of the programming process.

      Additional tests to be passed before official release are another matter.

  13. Neoc

    "Our content validator failed." So does this mean that NO-ONE in CrowdStrike uses their own product. Or even does such simple testing as loading it in a test machine and see what happens?

  14. ShortStuff
    Mushroom

    But It Was Only An Accident !?!?

    No, it was a trial run for election day ...

  15. ChoHag Silver badge

    > This serves as a sobering wake-up call for the rest of us in the tech industry.

    Good morning lads! Nice of you to join us!

  16. Bluck Mutter

    What I have never seen mentioned in all the comments is this: If CS have such crappy development/deployment processes how do you (the end user) know that their software actually works against threats?

    Do they have a test suite that ensures all known/existing threats are still detected for each deployed change or do they only do a unit test for the latest threat they are adding, assuming that whatever change they are making doesnt cause issues else where?.

    That, to me, is the biggest outcome of this: Does their software actually work in 100% of cases.

    I worked on a mission critical software product (not in the threat management space) used by extremely large international companies and government depts. world wide as a one man band: design, developed, test, deploy, implement and no matter what change I made, no matter how small, I did a full system test. The product had 12 modules for different source and target end points and a full systems test for any specific module took a week.

    Yes, unlike CS, I had the luxury of time (i.e. wasn't driven by the need to do rapid deployments against new threats) which means CS should have extremely robust test suites to validate the code works against all known threats. But do they?

    Bluck

    1. Anonymous Coward
      Anonymous Coward

      unfortunately, the answer to all your questions is no. Metrics come through, but who knows how accurate they are?

  17. HorseflySteve Bronze badge
    Pint

    Windows BSOD due to config file nothing new!

    Way back when I was using a Windows98 laptop, I had it suddenly start BSODing immediately it tried to switch to graphics mode during boot up.

    Safe mode crashed too, so the only option that worked was safe boot to command line.

    Faced with no means of trouble shooting, I was seriously considering nuke & re-install but, I was lucky enough to has access to another pc

    I searched for anyone else having a similar problem & found one report of a corruption in C:\WINDOWS\POWERPNT.INI leading to a similar crash. I didn't have MS Office on my laptop so thought I wouldn't find that file but, sure enough, there it was.

    Opening it with Intel's AEDIT, I found that a # character had somehow been inserted before the initial [ and deleting it, saving the file & rebooting resulted in a normal bootup. I confirmed it by recorrupting the file & the BSOD came back. I also found the Windows98 recreated the correct (working) version of the file if I just deleted it.

    I was never able to find that helpful online comment again so, if the author is reading this, have a very belated but well deserved one of these on me ------------------>

    1. John Smith 19 Gold badge
      WTF?

      I didn't have MS Office..so thought I wouldn't find that file but, sure enough, there it was.

      So you're laptop is stuffed by an MS file for an app that wasn't even there?

      Did you inherit this machine from someone else?

      Otherwise this implies an MS policy of loading stuff on the off chance you might be buying their app down the line.

      WTF?

      1. HorseflySteve Bronze badge

        Re: I didn't have MS Office..so thought I wouldn't find that file but, sure enough, there it was.

        No, it was bought brand new. It had never had MS Office installed, not even a trial version. I was using it for my Open University degree which required (at that time) Microsoft Works which did not have a presentation function.

        I think I have a Windows98 on a VM somewhere. If so, I'll have a look at powerpnt.ini to see what's in it. After all, it may not be related to the PowerPoint app, in spite of it's name, particularly as it was regenerated at the next boot after I deleted it

        1. John Smith 19 Gold badge
          Unhappy

          "may not be related to the PowerPoint app"

          Also possible.

          And highly suspicious if correct.

          Just the strategy a malware slinger would adopt. Giving a malicious file the sort of name you could expect to find on a lot of people's personal machines.

  18. Anonymous Coward
    Anonymous Coward

    On the plus side..

    .. we now have a couple of MacBooks on order to introduce some diversity onto the platform. We didn't use Crowdstrike, but management suddenly woke up to the fact we may not be so lucky next time.

    Silver lining :)

  19. Anonymous Coward
    Terminator

    Microsoft deserves some of the blame :o

    You know, I think with all of Microsoft developers and lawyers, they could come up with a better, legal way to avoid this kind of foul-up and let software companies compete equally. It's not rocket science.

    The ‘better’ way was to allow third-party AV apps. Better as in they could foist off legal culpability on on some third party.

    Microsoft doesn't want any of the blame, but it deserves some of it. For far too long, we've placed too many vital IT eggs in the Windows basket. When that basket falls, so does much of the economy.

    Microsoft deserves all the blame. Having designed such a defective OS. CrowdStrike was designed as a device driver so as it could detect malware before the main OS loaded. Which begs the question as to what was secure boot for?

    --

    insert down votes here:

  20. TeeCee Gold badge
    Facepalm

    I remember it well.

    When Win 7 was in gestation, one of the "big ticket" items was that it would deny kernel mode access to third party software. The big A/V and corp bourgeware providers threatened to sue and MS backed down. I remember thinking at the time that they should have fought it, but they would probably have lost...

    I'm afraid that MS are right about the EU. That bunch of tits are currently trying to force Apple to open their kernel, courtesy of their "digital gateway" arsehattery, when we now have proof that they should be enforcing the exact opposite.

    You want stable operating systems? Keep politicians and lawyers well away from them.

  21. DamienH

    This is literally the biggest Who, Me? in the history of the industry. Should be a good article here on The Reg in about 20 years.

  22. shawn.grinter

    Offshoring

    No comment on the fact CrowdStrike offshored their development to India in early 2024!

    1. ecofeco Silver badge

      Re: Offshoring

      Nor that George Kurtz, then CFO of McAfee during the infamous McAfee debacle exactly like this one, is now the CEO of Crowdstrike and made these staff changes?

  23. anthonyhegedus Silver badge

    Microsoft business as usual

    Microsoft like to have their ducks all lined up in a row, or as I like to call it, "they have the blame path set".

    Why take the blame when It's an update in a piece of software that isn't written by them?

    Why take the blame when it's not their fault if the EU mandated allowing third parties access to the kernel?

    Why take the blame when it's the antivirus companies who begged them to allow kernel access?

    Why take the blame when it's THEIR OS that once again, has failed to be an OS and actually manage stuff?

    MS are more interested in rearranging features, introducing "new" features that we've actually had since Windows 7, adding advertising and upsell opportunities, charging for features you'd expect to be part of the OS and generally shafting the end user. I've said it before in these pages, and I'll say it again: Microsoft have a track record of making largely shit software. Disagree? Tell me about their really good software that everybody liked.

    1. David Nash

      Re: Microsoft business as usual

      I am interested to know what MS could have done to prevent this. I am far from a MS fan but they are taking a lot of flak for something they didn't do.

      Running a buggy software at kernel level on any OS doesn't mean that OS is at fault.

      1. anthonyhegedus Silver badge

        Re: Microsoft business as usual

        That was my initial reaction too. However, the OS should be able to recover from this. If a kernel-level app BSOD's the device, the OS should know to maybe stop it running and flag an exception. Or at the very least have given sysadmins an option.

        Also, why allow kernel-level 0 stuff to run at all from third parties. I don't buy the "The EU made us do it" excuse. MacOS seems to be fairly secure and they don't allow Kernel-level stuff, after all.

        I mean obviously crowdstrike were the root cause of this, but I still think MS has to take some of the blame for having a gung-ho attitude to security.

  24. Anonymous Coward
    Anonymous Coward

    Credit where it's due, i spent over 16 hours on a single non-stop Teams call with my global peers as we worked to recover about 50% of our global server estate and about 20% of our workstations. It was faultless.

    After the inital "squeaky bum" of not being able to authenticate on our Virtual Centre OR access our Password vaults, it was quite a heart warming experience to get the team all pulling together on something that wasn't our fault but we worked tirelessly to resolve, like many others. From a collaboration and team-bonding exercise perspective, the whole event was really quite exciting to be a part of.

    I eagerly await the inevitable Ron Howard directed moving staring Ryan Gosling as the plucky sys admin who has to overcome this crisis single handedly to save America (f*** YEAH!) from annihilation. Probably call it "Blue Friday" or something....

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like