back to article How did a CrowdStrike file crash millions of Windows computers? We take a closer look at the code

Last week, at 0409 UTC on July 19, 2024, antivirus maker CrowdStrike released an update to its widely used Falcon platform that caused Microsoft Windows machines around the world to crash. The impact was extensive. Supply chain firm Interos estimates 674,620 direct enterprise customer relationships of CrowdStrike and Microsoft …

  1. 45RPM Silver badge

    How did this happen? It seems to me that it happened because CrowdStrike’s quality engineering and release procedures are nowhere near fit for purpose. But it also happened because Microsoft still only pays lip service to security on Windows. CrowdStrike shouldn’t have released what they did - but equally it shouldn’t have been able to take down Windows.

    1. doublelayer Silver badge

      This article explains, if you didn't already know, why Windows has to go down when code which is running as part of the kernel breaks this badly. Guess what would happen if a kernel module I loaded into Linux, Mac OS, or any other operating system had a memory violation. That's right, it would panic. It is required to panic. If it did not panic, that kernel has a serious reliability problem.

      Until people understand that, the attempts to find a reason why Microsoft is to blame here will not work. Maybe you or someone else can actually find a thing that Microsoft should be doing differently related to this, but while people continue to post comments trying to blame it for doing something both standard and necessary, you will fail to make any case because it appears that you have a gap in important systems knowledge.

      1. Anonymous Coward
        Anonymous Coward

        You miss the point. Yes, buggy code that ends up in the kernel will cause crashes.

        The issue here is why a vulnerability tool has to go in the kernel. Something like that should only be running in user space: no ifs or buts.

        Microsoft is to blame for allowing any old shit to go into the kernel. Just as other OS implementers would be to blame if they did that and exposed their customers to Crowdstrike's fuckup. But they didn't. Microsoft did.

        1. doublelayer Silver badge

          Several things in your comment are wrong or misleading:

          "The issue here is why a vulnerability tool has to go in the kernel. Something like that should only be running in user space: no ifs or buts."

          It goes in the kernel so that it has more visibility and control over what happens. There are some things that can't be done from user space at all, for perfectly good security reasons, and others which can't be done efficiently from there.

          Next, the Microsoft is to blame for putting it in. They didn't. CroudStrike is not a Microsoft product or dependency. People install it. Just as if I write a kernel module, I didn't ask for or get Linus's sign-off before running it. People are able to install things at kernel level, and they make the choice whether to do so or not. It is not Microsoft's decision to permit it, and if it was, we would be rightly complaining about the level of authority they claim to have to make that choice for us. They should not and do not deny people the right to do something potentially damaging with their own computers.

          1. Anonymous Coward
            Anonymous Coward

            "It goes in the kernel so that it has more visibility and control over what happens."

            Bollocks! One of the principles of security is least privilege. Which IMO means nothing goes in the kernel unless it absolutely must go there. The kernel is no place for a fancy anti-virus tool.

            "There are some things that can't be done from user space at all, for perfectly good security reasons, and others which can't be done efficiently from there."

            Agreed, Crowdstrike's crap has no good reason to be anywhere near the kernel.

            "Next, the Microsoft is to blame for putting it in. They didn't."

            I never said they did that. They provided the means to do let hazardous code get into the kernel. In the same sense as cigarette manufacturers provide the means for people to inhale hazardous chemicals and get lung cancer.

            If Microsoft cared about their reputation - no sniggering at the back! - they would have some way of validating, auditing and testing third-party code that wanted to go in their kernels. Either they didn't have that or their checks went wrong by giving a free pass to Crowdstrike's shit. Which means Microsoft are partially to blame for the recent global meltdown. They're not the only ones that have to take the responsibility for that of course. But they have some culpability.

            1. joeldillon

              ' IMO means nothing goes in the kernel unless it absolutely must go there. The kernel is no place for a fancy anti-virus tool.'

              Andrew Tanenbaum and the other microkernel advocates lost that argument in the 1990s (to Linus Torvalds among others). Some (many) things do go in the kernel which in theory need not do but for efficiency reasons do.

              1. Charlie Clark Silver badge

                No, Intel's x86 architecture imposed a huge penalty on context switching so OSes, starting with Windows NT 3.51 decided to move drivers to the kernel.

                1. JRStern

                  I remember 3.51 and all that jazz but that was a million years ago, hasn't there been huge progress in x86 processor design, around virtualization, that makes it faster now?

                  1. the spectacularly refined chap Silver badge

                    In a word... no.

                    Indeed if anything the situation is getting worse. Ever longer pipelines, even larger caches etc all conspire to increase the penalty of each context switch.

                    1. BossHobo

                      This comment had me engaging with chatgpt to learn a little something. Shouldn't today's multiple cores and threaded environments lessen the impact of such context switching? Could we (rather, OS developers) plausibly move drivers back to user space and would it be worth the added stability?

                    2. Julz

                      Spot On

                      Preemptive execution and huge caches strikes again. Given some situations, you can have a lot of rolling back and invalidating of cache lines to do on a context switch which has caused all sorts of design decisions to be made including the one to move away from micro kernels which, I feel, is a poor one. When I was doing such things, I measured Ultra SPARC CPUs using an average of 4 clock cycles to perform a context switch in and out of kernel space. That figure on modern CPUs and kernels is in the order of hundreds and even thousands of clock cycles, and no, they are not clocked that much faster. The quest for straight line CPU speed and marketing bragging rights amongst CPU manufactures has had many consequences in both the security of and the real world speed of systems.

            2. This post has been deleted by its author

            3. Michael Wojcik Silver badge

              Still wrong.

              Anti-malware software has to run with kernel or kernel-equivalent privilege, because it has to hook scores of APIs. Do a little research before you post and you won't sound so ignorant. Read things like Marcus Hutchins' blog, for example, and learn how things actually work.

              If anti-malware software ran merely as a privileged user-space process, then malware running as a privileged user-space process (which happens all the time on all the major OSes,1 because of the same architectural issues) could bypass it.

              Now, it's conceivable that a microkernel OS designed for security with good privilege separation, like CheriOS (which is built on top of CHERI), might provide a better, less dangerous position for anti-malware software to run in. It's not trivial to figure out exactly where to place it in that architecture, but at least there are more choices. A virtualizing OS like Qubes OS (or even IBM VM and its descendants) might also provide some better options. And, sure, some people will argue that the original NT driver model with the HAL would make this sort of failure less likely. But a complete rearchitecting of Windows at this point would be an infeasibly expensive and risky undertaking at this point.

              And Linux and UNIX are no better; they also have drivers in the kernel, and anti-malware there faces the same problem: run in the kernel and risk panics, or run in user-space and risk being bypassed by other privileged user-space processes. (I understand Crowdstrike on Linux, when run in user-space, just compiles a bunch of code using eBPF, which means it still has code running in kernel mode. And if you think eBPF has never been a problem, you aren't paying attention.)

              The real fix is to move to an actually modern OS like CheriOS. Good luck with that; some people need to run software they don't control or don't have the luxury of rewriting. In the meantime, rather than complaining, consider running Windows in VMs rather than on the bare metal. There are other host OSes, like, oh, Qubes.

              1Yes, including e.g. zOS. Hello, APF-privileged loadlibs!

          2. DS999 Silver badge

            Sorry Microsoft shares a lot of the blame here

            It goes in the kernel so that it has more visibility and control over what happens. There are some things that can't be done from user space at all, for perfectly good security reasons, and others which can't be done efficiently from there

            Crowdstrike's Linux software comes in two versions, one that has a kernel module like Windows but also one that runs in user space and uses the eBPF (packet filter) facility, which is the one they encourage people to use. If Windows had a similar user space API, the error would have only brought down the Crowdstrike software, not the entire Windows kernel.

            macOS goes one better - they have managed to move networking out of the kernel entirely and it runs in user space, so similar software (I'm not sure if Crowdstrike is even available for the Mac) would never have needed to run within a kernel context.

            So no, Microsoft did have a choice here. They've supported antivirus applications since the early days of Windows and have had plenty of time to build better ways to interface them with the OS. Instead they've chosen to invest developer time regularly changing the GUI and recently on giving Clippy "AI".

          3. Dan 55 Silver badge

            It goes in the kernel so that it has more visibility and control over what happens. There are some things that can't be done from user space at all, for perfectly good security reasons, and others which can't be done efficiently from there.

            Only because MS never made an endpoint security API to make this stuff accessible from userland. Something like this, which Crowdstrike uses on the Mac version.

            Please let's not defend Window's architecture as the way things must be when we have two examples in alternative OSes of how it can be done better.

            1. Snake Silver badge

              RE: examples that have done better

              That's a tall claim! The two OS's that are mentioned to have alternate methods of hooking an AV system into, can you confirm that Microsoft's OS structure actually has that ability without breaking decades of compatibility?

              You're talking out of your league and out of your butt. The kernel method of AV integration is how MS's OS's have always worked and we have not been shown that doing otherwise is even an option in their kernel architecture. The OS's mentioned even have user isolation and security that MS doesn't even have, and may not ever be able to have without breaking decades-old software.

              1. Dan 55 Silver badge

                Re: RE: examples that have done better

                AVs and endpoint security software by definition has to be up to date. MS could develop an API and make developers transition over simply by not signing off any more WHQL drivers for this kind of software after a certain date. It wouldn't break any other decades-old software, why would it?

                So the remaining question is it even an option in their kernel architecture. It should be. Can they make the necessary changes to accomplish this? That's up to MS. If they can't it doesn't bode well for future improvements to Windows that aren't just rejigging the UI.

                If it turns out they can't then they should at least make recovery easier. Return F8 and "last known good configuration" to the boot sequence and make recovery from failed driver loading easier and more automatic.

                1. Tron Silver badge

                  Re: RE: examples that have done better

                  MS should still have had a recovery position, perhaps one that distinguished between stuff in the kernel that was their own/essential, and 3rd party stuff. If you like, a 'safe mode' for kernel crashes. As a previous comment said, they prioritise degrading their OS with worse versions, gimmicks, bloat and restrictions, when we would be happier to have stuck with the same OS for years longer, with better resilience being added.

                  So some fault to MS, but most to CrowdStrike who could not have done anything like enough testing on their code, or they would have known it was cack.

                  1. doublelayer Silver badge

                    Re: RE: examples that have done better

                    You mean that one mentioned in the article. It might have read something like this:

                    "The way that it works is that drivers can set a flag called boot-start," he said.

                    "So normally, if you've got a driver that's acting kind of buggy and causes a failure like this, Windows can auto resume by simply not loading the driver the next time. But if it is set as boot-start, which is supposed to be reserved for critical drivers, like one for your hard drive, Windows will not eliminate that from the startup sequence and will continue to fail over and over and over and over again, which is what we saw with the CrowdStrike failure."

                    So they have that by default, and it would have done exactly what you describe except that a flag was set specifically to bypass that safety feature. As it says, there's a good reason to allow something to set itself that way, in case this is required for the system to boot correctly anyway.

                    1. Dan 55 Silver badge

                      Re: RE: examples that have done better

                      Only Crowdstrike isn't necessary for a system to boot properly anyway. That flag shouldn't really be set for it.

                    2. Michael Wojcik Silver badge

                      Re: RE: examples that have done better

                      Kernel-mode anti-malware on Windows typically uses boot-start because otherwise there's a race at startup which malware can exploit. That should be obvious.

                      So it's a tradeoff between risks, as happens all the time in security. You can argue Crowdstrike made the wrong choice — and I'm not interested in defending them; the Friday incident was a monumental screw-up and they don't deserve any defense — but it's neither arbitrary nor patently incorrect. We've seen plenty of Windows malware delivered as signed drivers (because many OEMs and ISVs are not good at key hygiene), so early loading has justification.

                      1. Dan 55 Silver badge

                        Re: RE: examples that have done better

                        It seems from the docs that early loading and marking a driver as required for booting are two separate things and a non-boot driver can be set up to load straight after boot drivers have loaded.

                2. Anonymous Coward
                  Anonymous Coward

                  Re: RE: examples that have done better

                  Last known might not work here. windows wouldn't have known about the update as it didn't go through the windows install process. this update was done without windows knowing about it. You would probably have had to go back to the original crowdstrike app/driver install.

              2. Someone Else Silver badge

                Re: RE: examples that have done better

                The kernel method of AV integration is how MS's OS's have always worked and we have not been shown that doing otherwise is even an option in their kernel architecture.

                And, of course, that makes it right and forever immutable.

                Bollocks!

                What it makes it is hidebound. If Micros~1 could ever be made to get off their thumbs and behind the concept of refactoring (and innovation, but I digress...), perhaps they would figure out a way to eliminate this problem.

                "Perhaps", because given their recent track record, it's not cleat they could write a "Hello, World" program without fucking it (and/or something else) up.

            2. Peter Gathercole Silver badge

              Endpoint Security API

              This MacOS facility looks a lot like the Auditing API on AIX which has been in use for 30+ years on POWER/PowerPC/Power systems.

              The protection it provides depends on the event code that contains instrumentation. I don't know MacOS, but I am somewhat familiar with AIX.

              In that OS, pretty much every system call has a block of code that can be configured to drop useful information into a ring buffer, which can be read using a specific system call to allow out-of-kernel code to process the events for auditing purposes. There are selectors that allow events to be not logged, or acted on or logged buy ignored depending on the configuration of the subsystem.

              But if MacOS works like AIX, this is not actually enough to act as an AV filter. It does allow you to notice when files are being opened or accessed. It does allow for all sorts of other system events to be recorded (originally, almost all kernel routines were instrumented, but I cannot say that is true for AIX any more). But importantly, for things like file or network read and writes, it does not give you the ability to look inside the data being passed around. It's like it keeps the metadata and a subset of the data itself, but not the whole data, a bit like the phone system that may keep information about who you called and when, but not the actual conversation.

              This makes it useful to record (and in some cases act on) certain events, but does not give it the complete access that an in-kernel driver, which has access to all of the internal buffers and data control structures in the kernel in a traditional monolithic kernel.

              Someone mentioned eBPF for Linux and Linux-like OS's. This allows you to hang external code written in a p-code on hooks provided in some kernel routines. This can be done dynamically after the OS has been started, and has been criticised for being able to intercept and in some cases modify in-kernel data using code that is not in the kernel itself. I have always worried about this as a feature, because it sounds to me like a way of altering kernel behaviour with little or no oversight. The original BPF code allowed you to put code into the pseudo machine implementing the filters to trigger certain actions like dropping or logging certain specific (originally network) events, but eBPF allows much more access to the data, and works on more than just network events. It probably does allow the type of actions that CrowdStrike want to do, but the code runs in a kernel pseudo machine that can be allowed to crash without affecting the OS.

          4. andy 103
            FAIL

            CroudStrike is not a Microsoft product or dependency. People install it.

            The problem being that people install it at airports, banks, and on other critical infrastructure. Then can't cope when it all falls down.

            All from organisations with reams of policies about that not happening, yet allowing it to happen so easily.

          5. Randy Hudson

            > It goes in the kernel so that it has more visibility and control over what happens

            Bullshit. Whatever visibility and control it needs can be exposed using APIs that call out to a user space process. This is exactly how crowdstrike for macOS works.

            1. doublelayer Silver badge

              Yes, they could have implemented a two-stage process where they still have a kernel-level program and it provides data out to something else. There might have been an efficiency drop by doing that, but it would probably be fine enough. The critical point, however, is that this change, while it might have prevented this problem, still involves their being code running at kernel level which, if it broke, would break the kernel. The attempts to blame Microsoft often take the form of explaining that CrowdStrike shouldn't have run anything at kernel level at all, which would not work, and then finding a reason why it's Microsoft's fault that they could, which it isn't.

        2. TheMeerkat Silver badge

          For those blaming Microsoft for CrowdStriker having to go into kernel - you clearly don’t understand how security tools work. Yes, they have to go to the kernel to be effective.

          1. Adair Silver badge

            Which even if true at an absolute level (moot), doesn't absolve MS from it's responsibility to provide an OS that is capable of effectively mitigating such disasters, e.g. rollbacks, immutability, etc.

            We're still expected to put up with what amounts to a 'toy OS' for use in frontline services as though that is acceptable and normal.

            Smells more like 'profit before people/security'.

            1. Anonymous Coward
              Anonymous Coward

              If CrowdStrike hadn't set boot start we wouldn't even be having a discussion.

              1. I ain't Spartacus Gold badge

                If CrowdStrike hadn't set boot start we wouldn't even be having a discussion.

                Because the best thing for computer security is for anti-malware software to silently be disabled on reboot so malicious software can more easily take over out computers.

                1. Anonymous Coward
                  Anonymous Coward

                  Original AC here:

                  Obviously not, but there is definitely a problem when the only choices are "crash and stop the OS" or "start the OS with no malware protection".

                2. Kiers

                  LOL. It's like after 9/11....we should dismantle our constitution and sense of panic at panopticon surveillance for our security.

              2. Dan 55 Silver badge

                If CrowdStrike hadn't set boot start we wouldn't even be having a discussion.

                Or if Microsoft hadn't signed off on that driver due to that flag. They could always have said no.

                1. Steve Channell
                  WTF?

                  This article says they couldn't ("https://www.theregister.com/2024/07/22/windows_crowdstrike_kernel_eu/")

                  There is simply NO EXCUSE for what CrownStrike has done: they've taken advantage of a signed kernel driver to side-load code into the kernel in contravention of the license agreement. We can expect a future Patch Tuesday to black-list the csagent as malware.

                  To describe these sys files as "broken configuration file" insults the intelligence of readers : either they include binary executable code, or are virtual environment like https://en.wikipedia.org/wiki/EBPFbut inferior.

                  1. Dan 55 Silver badge

                    The EU only said that whatever MS did for their own endpoint software, they should offer third parties the same access that they have themselves.

                    MS could have developed a security endpoint API for themselves and third parties but instead they decided to allow kernel access for themselves and 3rd parties via kernel drivers.

                    That doesn't excuse Crowdstrike's shonky code of course.

                  2. Michael Wojcik Silver badge

                    To describe these sys files as "broken configuration file" insults the intelligence of readers : either they include binary executable code, or are virtual environment like https://en.wikipedia.org/wiki/EBPFbut inferior.

                    Eh? That's prima facie incorrect. It is certainly possible for invalid data to trigger a logic bug in the code that interprets it. Indeed, that happens all the time. Have you never heard of fuzz testing?

                    A trivial example: Read an integer type field from a record, use it (without validation) as an index into a table of addresses, attempt to dereference the retrieved address (perhaps plus some offset). If the record has an invalid type value, you can get precisely the symptoms seen in this case.

                    1. Steve Channell

                      Nope, did you not see the reference to eBPF?

                      eBPF includes a verifier where shoddy software does not.

                      As someone who has written kernel-level code, I can assure that it is possible to write formally provable code using Formal Methods - I don't anymore because the levels of code review, profiling, verification, {unit, system, integration, regression, performance} testing are prohibitive. duff data is only an issue if you don't verify it.

                      I know my surveillance code as 100% reliable, not because I'm some kind of genius, or used VDM mathematical proof; it was 100% reliable because it checked every pointer and fell back to a read-discard loop (after WTO instruction) that ensured it did no harm. CloudStrike's code is surveillance code - their first work-around "fix" was to delete csagent.sys

                      to translate your comment " It is certainly possible for shoddy invalid data to trigger a shoddy logic bug in the shoddy code that interprets it. Indeed, that happens all the time in application code"

                  3. diodesign (Written by Reg staff) Silver badge

                    'broken configuration file'

                    At the time config file was the best description we had. This is an evolving saga. Our latest article (linked) gets closer to the specifics, that the channel files customize how templates of code run to detect particular malicious activity.

                    The file in this case was poorly formed, caused its interpreter within Falcon to crash. This was missed in the automated testing.

                    C.

                    1. Julz

                      Re: 'broken configuration file'

                      Turing might like to have a posthumous word about the impossibility of determining the difference between code and data...

          2. Dan 55 Silver badge

            Yes, they have to go to the kernel to be effective.

            They don't have to on macOS.

            1. 45RPM Silver badge

              Ssh! Don’t mention the Mac word. It doesn’t count until Microsoft rips the technology off - and then the argument will be made that Apple also ripped the technology off because… something.

          3. fg_swe Silver badge

            LSM Linux Security Modules

            LSM also uses kernel level code to control and intercept potentially ALL userspace-to-kernel calls.

            Yes, some security things must be done in kernel mode. But with that comes extreme duty of diligence by the "plugin" author. A config file error must never generate a bad pointer, but simply ignoring said config file.

            CrowdStrike has a very bad quality assurance in place. The government should fine them for neglicence.

        3. This post has been deleted by its author

        4. gnasher729 Silver badge

          If this happened to Apple: 1. Apple would tell crowdstrike “the way you f***** up, we won’t let any crowdstrike code run in the kernel anymore. If that’s a problem, figure it out”. 2. Crowdstrike sells itself to tencent which runs to the EU which fines Apple for being “anticompetitive” and hurting consumers.

      2. Yorick Hunt Silver badge
        Holmes

        Up to and including XP, you'd be offered "Start Windows with last known good configuration" in the event of a boot sequence failure. In *nix land, Grub keeps copies of old kernels and boot scripts and allows you to similarly revert to something which actually works.

        Newer versions of Windows though, following Microsoft's decree of "users are too stupid to manage things," simply steamroll through, no matter what, at full speed towards the introduced brick wall.

        Given the number of snapshots being taken left, right and centre, surely it shouldn't be that difficult to implement a failsafe boot process? Or is this going to be their major selling point for "Recall?"

        1. Anonymous Coward
          Anonymous Coward

          As far as I recall, all last known good configuration does is use the alternate HKLM control set (there are two that are switched at each boot) so it is unlikely to have helped in this case.

          In fact, I cannot recall a single situation where using last known good helped solve issues I was facing.

      3. 45RPM Silver badge

        The 1980s called asking for you specifically. They want their developer back.

      4. Management Order

        CS is not a kext on MacOS

        Interestingly, Crowdstrike Falcon is not a kext on MacOS, so can't bring the kernel down. Also, Crowdstrike Falcon did bring down Linux kernels back in may, so its not like they didn't know that have a problem with this.

      5. Randy Hudson

        Crowdstrike is running on my macbook, providing the exact same protection it does on windows, but without running as a kernel extension. You might want to familiarize yourself with how that's possible.

    2. david 12 Silver badge

      RedHat had exactly the same problem with a CrowdStrike causing a kernel panic after loading a bad channel file.

      If you want a resilient system that can't be taken down by drivers that are marked as required for boot, then you want a different kind of machine architecture, not a different OS.

      1. Andrew Hodgkinson

        RedHat is buggy; that's not a "new architecture" issue

        Allowing kernel drivers to fail gracefully is a long-solved problem, but quality engineering is expensive and mainstream vendors are cheap a**holes only interested in shareholder gains. As for this specific RedHat crash - please read:

        https://news.ycombinator.com/item?id=41030352

      2. Dan 55 Silver badge

        Apparently a bug in RHEL's eBPF implementation?

      3. thosrtanner

        Sadly, initial intel machines had 4 rings, but nobody used them and they've apparently been dropped. maybe we should all go to older hardware - we'd get rid of all the problems with speculative execution side channel attacks then as well.

        1. Anonymous Coward
          Anonymous Coward

          think os/2 and netware did.

        2. Julz

          Pa

          ICL's Estriel CPUs had 7 :)

  2. W.S.Gosset Silver badge

    Canary releases?

    Is that the official new name? I just used to call them phased releases, back in the day. Or just common sense.

    1. Doctor Syntax Silver badge

      Re: Canary releases?

      The term's been around for a while, if not in relation to releases, in other contexts. About 10 years ago it became a practice fo post assurances that a business had not been serves a subpoena by a given date. Failure to update it was an indication that it had received one without breaching any terms the subpoena may have contained forbidding an announcement that it had. The origin, of course, is a comparison with the coal-miner's canary which would be more susceptible to carbon monoxide poisoning than the miner - not a close analogy with the warrant canary but it fits well with a sacrificial S/W instance which can be exposed to a pending update.

      1. Malcolm Weir

        Re: Canary releases?

        Where this works (and it doesn't always) is in legal jurisdictions where compelled speech is forbidden or extremely disfavored (e.g. the USA). A court can order someone to NOT say something until the matter has been fully adjudicated, but requiring a statement usually is impossible until after the court has heard, and judged, both sides.

        Of course, a government can always request speech ("I counted them all out..."), but the decision to comply has to be voluntary otherwise the speech is compelled.

    2. david 12 Silver badge

      Re: Canary releases?

      The canary releases were one part of the phased release system. Canary was the name used for that phase, somebody thought it was more informative than just "phase 3" or whatever.

    3. that one in the corner Silver badge

      Re: Canary releases?

      > I just used to call them phased releases, back in the day.

      But were you dealing with code that could totally knacker the machine?

      The canary falling of its perch is a good analogy for a BSOD (shortly followed by an explosion - of expletives heard all around the open plan office).

      But if your app failing just meant it had to be restarted whilst the rest of the User's tasks progressed as normal - well, "signal the alarm, the canary has a bit of an itchy wing" doesn't have quite the same ring to it.

      1. Anonymous Coward
        Anonymous Coward

        Re: Canary releases?

        "But were you dealing with code that could totally knacker the machine?"

        Anyone who rolls out Windows security updates is currently having a panic attack ;-)

      2. david 12 Silver badge

        Re: Canary releases?

        But if your app failing just meant it had to be restarted

        MS canary releases weren't just boot drivers. The the word (and process) was used for other things as well (Browser, tools, etc)

      3. gnasher729 Silver badge

        Re: Canary releases?

        Staged releases don’t protect machines that get the update. But it protects the 99% that were supposed to get the update later. 80,000 crashed computers in the Uk would have been a lot better than 8 million.

    4. ghp

      Re: Canary releases?

      "Chi va piano va sano e va lontano", indeed. Whatever you call them, their customers want to be protected ASAP. But of course, quis custodiet ipsos custodes?

    5. veti Silver badge

      Re: Canary releases?

      Phased releases are not common practice in the security industry, specifically, perhaps because it's considered impolitic to give attackers a detailed schedule of changes to the defence before they're made.

      1. thosrtanner

        Re: Canary releases?

        Err. How does doing canary releases suddenly give attackers a detailed schedule of your rollouts? Especially if you keep moving the canaries around?

      2. Dan 55 Silver badge

        Re: Canary releases?

        Or the security industry is too used to moving fast and breaking things?

        With a product such as Crowdstrike, only used by enterprises, the chances of giving attackers info is slim.

        1. fg_swe Silver badge

          FALSE

          CS is used in thousands of enterprises. One of them will be penetrated by criminals or other enemy actors.

      3. fg_swe Silver badge

        Bingo

        Cybernetic attackers will analyze patches in order to attack not yet updated systems.

        Patches should be thoroughly tested by the authoring company.

        Also, they should be ditributed/staged encrypted on all affected computers and only after that the key should be broadcast and patch actually applied.

  3. Anonymous Coward
    Anonymous Coward

    So why was table lookup done in pspSystemThread?

    As someone who has looked at far too many driver crash dump screens since the mid 1980's (usually my own code) the first thing that jumped out was why the hell was this being done in the main thread in a direct OS call Kernel Space thread and not a worker / auxiliary thread.

    Any code called from a OS kernel dispatch / callback should be just doing very light-weight stuff. Unless it is an actual hardware device driver. And even then you try to keep those main call threads call trees as lightweight as possible. All heavy lift code should be on separate threads. Even better - processes. Either in Kernel Space, or much better, User Space. User Space for agents. Always. So that when stuff does wrong (which it always will) you have at least some chance of handling it gracefully. Without bringing down the whole damn OS. This is the NT kennel after all. Where since 4.0 drivers are now in Ring 0. So no safety net.

    At least running agent code in User Space will give you some form of Structured Exception Handling support. Which stops BSODS from happening. Mostly. In Kernel Space there are fewer options but exceptions like page faults can be handled. And have been for many decades. If you know what you are doing.

    Then there is the fact that all Kernel Space user application code should be written with asserts everywhere. And I mean everywhere. Every second or third line should be an assert. With the assert code supporting graceful error / failure handling and recovery. And not only checking legal range of everything going into calls but legal range everything coming out. Just like properly engineered code. In embedded and RTOS software.

    We can blame MS for putting drivers and related code in Ring 0. Since NT 4. And we should. But you cannot blame MS for CrowdStrike's utter technical incompetence. Which is the case here. Just look at the csagent call stack code symbols and offsets in the PAGE FAULT crashes.. A singe thread call tree with > 3Meg byte code offset values. Anything over a few 100K byte offsets in code like this in the main thread is starting to push your luck. CrowdStrike must have just stuck all the code in a single thread with zero isolation and partitioning. Which MS (and other OS vendors) have been telling device driver writers not to do for the last 30 plus years,.

    Maybe CrowdStrike should hire some people whose have actually read carefully (and understood) the DDK docs. And knows how WDM drivers / kernel code etc works on the bare iron. Its not like the relevant kernel source code has not been out in the wild for the last 20 years. If you want to know what is really going on. Because its pretty obvious that no one currently working for CrowdStrike has a clue about any of this stuff.

    1. elDog

      Re: So why was table lookup done in pspSystemThread?

      Guessing out of hubris and laziness.

      Many a neophyte programmer thought it would be much easier to write code to stay in one protection level (kernel) than go through the hoops of having another process handle the real work.

      Remember when everything we wrote was at level-0 on the early micro-PCs?

      I grew up with IBM-360s and learned to love the GE-600 series master/slave levels. Then they build the Multics machines from whence (somewhat) Unix was spawned. I love having the hardware tell me that I f'd up without having to read through a full kernel dump.

      1. Anonymous Coward
        Anonymous Coward

        Re: So why was table lookup done in pspSystemThread?... it was laziness

        Have seen this scenario play out a few times over the years. I bet if you look at the source code for the various components you will find that CrowdStrike just took the MS DDK and SDK docs sample code for drivers and callbacks and just pasted in their own code as one big blob. Copy / Paste / Maybe Test it a bit / Ship It.

        Then sooner or later it all blows up and they learn the hard way (if they dont go out of business first) that you cannot just paste code in wherever you think it might fit. The codebase will have to be properly architected into very strongly horizontal and vertical partitioned functional blocks. With very robust error detection and recovery. This is not Win32 application code were you can get away with sloppy code and any old crap. This will have to be embedded software quality code. Which is a whole different ballgame.

        So you actually have to hire people who know what they are doing. Not dot com hires fresh out of college with fancy PhD's. Or outsourced to foreign body shops.

      2. david 12 Silver badge

        Re: So why was table lookup done in pspSystemThread?

        Probably because this was the load phase. The main thread loads the template processes and then sleeps.

    2. Bebu Silver badge
      Windows

      Re: So why was table lookup done in pspSystemThread?

      Thank you.

      Answered a lot of questions for me (not being a windows person ;)

      Having observed crowdstrike software under linux I was quite sure MS wasn't the main culprit in this fiasco (for a change.)

      Crowdstrike appear to be claiming the offending .sys file(s) were data and not executables which is a little disingenuous. If the data didn't alter the execution of their code then why load it? Possibly more accurate to think of their kernel module as an interpretor running in a kernel context whose code are these .sys files. A bit like a third rate version of eBPF I imagine.

      1. Michael Wojcik Silver badge

        Re: So why was table lookup done in pspSystemThread?

        Crowdstrike appear to be claiming the offending .sys file(s) were data and not executables which is a little disingenuous. If the data didn't alter the execution of their code then why load it?

        Data does alter the execution of code, in the general case. That's the whole point of data. Perhaps you need to review the theory of computation.

        Honestly, I cannot understand why some people are making this argument. To the extent that there's any distinction between code and data, this is precisely that distinction. TM is in state A, reads a datum, moves to state B according to the transition for that datum.

    3. thosrtanner

      Re: So why was table lookup done in pspSystemThread?

      Totally agree with most of this (especially the insanity of only using 2 rings - kernel and user. Well, that's not fair, entirely, but proper access privileges - like device driver threads have privileges to write to *their* device pages and to read/write user memory WHERE THE USER has given permission (by making a system call asking for memory to be transferred to/from the device). Antivirus software afaics needs even less privilege than that, because, honestly, if your a/v stuff crashes - you need to know, sure, but you can carry on using your system (though disconnecting from the internet would seem a good idea).

      But crowdstrikes code passed WHQL validation. And that is microsofts fault. device drivers that read files of disk and do things with them is not a great idea.

    4. Julz

      Re: So why was table lookup done in pspSystemThread?

      Exactly this. Sorry I can only upvote you once.

  4. TReko Silver badge
    Facepalm

    You get what you pay for

    In February 2024 Crowdstrike had layoffs in the USA and moved most tech jobs to India. They proudly announced this via a press release.

    It's the same as Boeing's 737 MAX ACAS software being outsourced to $9 an hour jobs in India.

    Unless very carefully managed, the savings are an illusion.

    1. ecofeco Silver badge

      Re: You get what you pay for

      You mean that George Kurtz who was the CTO of McAfee when it did the same thing in the past, was just being George Kurtz, now that he is CEO of Crowdstrike?

      Shocking.

      1. Anonymous Coward
        Anonymous Coward

        Re: You get what you pay for...Kurtz has form when it comes to bricking PC's

        The Great McAfee XP Bricking Fiasco in 2010

        https://www.theregister.com/2010/04/21/mcafee_false_positive/

    2. deadlockvictim

      Re: You get what you pay for

      I wouldn't diss the Indians.

      My experience of them is that they are great programmers, at least the ones I have worked with, who are based in Chennai.

      The problems with outsourcing companies is more of a socio-economic problem to the country that has lost jobs than to the quality of the work done.

      1. Casca Silver badge

        Re: You get what you pay for

        They are good at following instructions to the letter. But not understanding it. Thats why a a SAP implementation had 58 companies named X.

      2. theOtherJT Silver badge

        Re: You get what you pay for

        I'm sure that there are a great many truly excellent programmers in India, they're just not working for the $9 an hour outsourcing outfits.

        We've got several Indian programmers, some of whom even work from India, at my place of work and as far as I know they're all very good at their jobs. The kicker is that they work for us not some outsourcing agency we hired in to do the job on the cheap.

      3. gnasher729 Silver badge

        Re: You get what you pay for

        Quote: “We can’t let our developers take a C++ course because with that course they’d leave us and start elsewhere”. (When my boss asked why outsourced developers couldn’t do C++).

    3. Michael Wojcik Silver badge

      Re: You get what you pay for

      In February 2024 Crowdstrike had layoffs in the USA and moved most tech jobs to India.

      Yes, but currently there's no public evidence this had anything to do with this disaster. The faulty logic may well have been in Crowdstrike's driver before February, and the broken process that allowed the release of the corrupt channel file may well have been in place as well.

      There's no shortage of idiots and lazy people in the US. Blaming this on Indians, or on new hires, or whatever, is simple prejudice. It does suggest that Crowdstrike have priorities other than quality (you don't lay off a bunch of developers if you care about quality; that's how you lose institutional memory, among other things), but it is not evidence that the layoffs were in any way the cause of this incident.

  5. that one in the corner Silver badge

    Gizza job

    This whole thing is just Kurtz's SOP for getting his name into the press before leaving for a new job, just like when he left McAfee.

    We all curse the very soil he walks upon, but for the money men in smoke-filled rooms: "Kurtz? Kurtz? I've heard that name somewhere, haven't I? Well, guess that means he's famous. Go ahead, lob some money at him, let's see what he can do. Got any more brandy?"

    1. david 12 Silver badge

      Re: Gizza job

      This whole thing is just Kurtz's SOP for getting his name into the press

      That's what they want you to believe.

      The whole thing is a marketing exercise to demonstrate the size and importance of the CrowdStrike customer base, and the critical importance of their product.

  6. Dagg Silver badge

    Why the hell didn't they (as in crowdstrike) actually validate the contents of the config file before actually using it?!

    In the past for me this was standard procedure for mission critical enterprise level processes.

    1. Casca Silver badge

      Ah, but that takes time to program. And time is money. Hope they burn in hell.

    2. gnasher729 Silver badge

      It’s possible that the config file was perfectly valid, but a latent bug in the reading code couldn’t handle that specific file. Like a GIF parser that might crash with some perfectly fine GIF file as input.

      1. Michael Wojcik Silver badge

        Yes, but again, testing would have revealed that. As would an initial internal rollout before pushing it to customers, or a phased rollout, or any of the other things people have been suggesting.

        Crowdstrike obviously screwed up badly, because they have bad practices. There's simply no other explanation, other than malice, which they've already disclaimed.

  7. Anonymous Coward
    Anonymous Coward

    Why was this flagged as boot-start

    Surely only ESSENTIAL hardware drivers need that. Basically, ones that are NEEDED for the system to limp up, show an error message and do enough to let the user fix things. Most hardware will run in legacy modes too, so no high performance gaming video drivers, just the basic VGA one would suffice.

    1. gnasher729 Silver badge

      Re: Why was this flagged as boot-start

      Obviously if your hard disk driver crashes, it’s game over. All you can do is try again and pray.

      A simple technique for the problem here: if you have more than one configuration file, you write “parsing xyz” to the drive when you start parsing it, and remove the message after success. You also check for this message. If you find it before parsing xyz, you write “parsing xyz for the second time”. If you find that, you skip that file with a big error message.

      That way you skip only one configuration file and only if it is a real problem.

    2. Anonymous Coward
      Anonymous Coward

      Re: Why was this flagged as boot-start

      If a bad guy could get rid of the security software by simply finding a way to crash it so the system reboots without it, that's a massive weak point in the defenses.

      1. Herring` Silver badge

        Re: Why was this flagged as boot-start

        Well there clearly is a simple (and now documented) way to crash it

        1. Michael Wojcik Silver badge

          Re: Why was this flagged as boot-start

          But not to run malware afterward, because that crash prevents booting.

  8. ColinPa

    Downloading stuff automatically

    I believe that one reason why the problem was pervasive, is that the code downloads stuff from it's server, outside of any change control/freeze/lock down. The same way that Antivirus code down loads its stuff. This seems to be a fairly general pattern, which saves the user having to make a decision. For the unattended machines, there is no one to ask,so it is configured to do it automatically.

    Trying to manage this model is a nightmare.

    1) Change the firewall to disallow access to the server

    2) On one day a week, open the path to the server for some of the machines, these machines get updated. Close it again

    You cannot have N-1 and N-2 images.

    3) Next week, open the path to some other machines, these get updated.

    4) etc.

    After some weeks all machines should have been updated.

    Now manage this with multiple servers downloading fixes automatically.

    Of course companies which have processes to distribute fixes and not allow automatic updates, and only allow access to approved sites should be better protected. But this is a lot of work.

    One quote I read ... It is expensive to do it right. It is even more expensive to do it wrong.

  9. andy 103
    Joke

    fail over and over and over and over again

    Windows will not eliminate that from the startup sequence and will continue to fail over and over and over and over again

    In jest of course. A quite rare example where Windows/MS aren't actually at fault for a change.

  10. tiago.pelicari

    Given this level of access, Microsoft itself should take responsibility for QA.

    1. Anonymous Coward
      Anonymous Coward

      They do. They issue WHQL signature to say the driver is authorised.

      But since it's a slow process and CrowdStrike want their updates to be faster to stop zero-day exploits, CrowdStrike get round this by updating config files and not the main driver.

      1. fg_swe Silver badge

        "config files"

        It transpires these are in fact similar to Java bytecode or UCSD p code. This bytecode is then interpreted inside kernel mode. Evading MSFT quality control processes.

        The interpreter will crash on bad bytecode, as we have witnessed.

        What a polish turd.

        1. Herring` Silver badge

          Re: "config files"

          I have also heard this theory. If true, then all the bad guys out there now know how to get evil shit executed in ring 0

        2. TimMaher Silver badge
          Coat

          Re: "polish turd”

          What have the Poles done now @fg?

          Anyway, it seems that CS have put a cherry on top.

          1. fg_swe Silver badge

            POLISHED

            ...turd.

            Errata.

  11. Sceptic Tank Silver badge
    FAIL

    Address 0x000000000000009c

    My memory is severely rusty on this topic and I don't have time / inclination to go study it up, but that address sound suspiciously like it could be a "page zero" access which is used in x86 / x64 to check for null pointer assignments. The Falcon people (haha "Falcon Heavy Rocket") do waffle about something null bytes or other in the channel file.

    1. Anonymous Coward
      Anonymous Coward

      Re: Address 0x000000000000009c

      That was only one of the crashes. Other crashes had other addresses which were not in page zero. So we think it was an uninitialized pointer, not a null pointer.

      But some right-wingers on Twitter said it was a null pointer caused by an incompetent disabled, Black or woman employee. What's wrong with some people :-(

      1. Anonymous Coward
        Anonymous Coward

        Re: Address 0x000000000000009c

        But some right-wingers on Twitter said it was a null pointer caused by an incompetent disabled, Black or woman employee. What's wrong with some people :-(

        I do not know, but apparently they post a lot on social media about how the fact that there are gay people, black people - and even female cartoon characters wearing trousers - means that there's some deep conspiracy to ... I don't know, make people polite or something.

    2. LessWileyCoyote

      Re: Address 0x000000000000009c

      Back in the days when I was programming on mainframes, an address with that many digits was an absolute address, i.e. relative to the total memory space of the machine. Address 9C would have been firmly in what we called "the bottom left-hand corner of the machine", where things like the system clock resided. I have no idea whether that concept translates in any way to the PC world, but I do know that if any process on a mainframe had high enough privileges to access that area, but was not the actual OS, everything stopped. Very quickly.

  12. Andrew Mayo

    If only there was a way to check if a memory access would fail

    Oh wait.... https://kernelmode.info/forum/viewtopic0aa3.html?t=5317

    There's an API for this - MmCopyMemory() - introduced way back in Win 8.1 - that allows driver code to verify that a memory address is valid without triggering a kernel trap.

    So any good software engineer would rigorously validate the information in the channel file because - even if these files are signed by Crowdstrike (and I don't know if they *are* signed) - you can't be sure that some malicious actor didn't perhaps manage to perform some kind of supply-chain attack and leveraged a channel file to cause disruption. Obviously in this case it was human error but if the driver code had been written resiliently, this would never have happened.

  13. Seajay
    Facepalm

    Why did they go EVERYTHING AT ONCE?

    Surely a better system for releasing such updates would be a phased release system? You then combine that with a "phone home" to confirm update successful.

    i.e. PC requests update, installs it, then confirms back that it's installed and everything is operating normally. Release system monitors the releases going out, and expected confirmations allowing it to quickly shut down a release if there is something amiss...?

    1. Brewster's Angle Grinder Silver badge

      Re: Why did they go EVERYTHING AT ONCE?

      And even doing a crappy Android app release, we always avail ourselves of the release to few percent of the users, and then hang on in case something shows up.

  14. Flak
    Coat

    Canary releases? More like guinea pig

    Would not like to be a canary in this scenario...

    1. Paul Anderson

      Re: Canary releases? More like guinea pig

      In my experience, most phased / canary updates choose their guinea pigs / early adopters arbitrarily. If your number comes up for the first phase of the update release and there is a bug, you're out of luck. You're an unwitting, paying beta tester and your live systems are on the line. Few customers consent to this or are even aware of it. I consider it to be unethical behaviour and I hate it. Worse still, when your systems go down they offer to 'help you' with the problem by getting you to run diagnostics and collect masses of log files. All of which they feed back into their beta testing process.

  15. Andrew Mayo

    Windows does of course have ETW

    Event Tracing for Windows does provide conceptually similar functionality I think to what Linux and MacOS have. Unfortunately the API for ETW is rather horrible, and this blog post wonderfully explains why.

    https://caseymuratori.com/blog_0025

    (Ironically, I used Casey's blog post to actually understand how to use the API). But, as mentioned by others, this is not an interception mechanism. If you want to actually stop a blacklisted executable being run, then you need hooks at kernel level to prevent the process actually being created.

    For this kind of thing, Microsoft provide the concept of 'filter drivers'. More specifically, minifilter drivers provide a fairly robust mechanism for intercepting all kinds of system operations, including process creation.

    https://stackoverflow.com/questions/58420338/intercept-process-access-using-a-windows-minifilter-driver

    So in fairness to MS they DO provide a decent set of tooling to hook AV/EDR software into the OS, but developers need to follow guidelines and build stuff carefully of course.

    PS: Interestingly, AV/EDR software can cause performance issues due to a phenomenon I call 'resource amplification'. What happens is that with software intercepting low-level functions like registry accesses or process creation, the AV software obviously has to do stuff on each call. This takes CPU resources and in some cases other resources like disk I/O.

    Now if the system gets busy, and moves close to the point where it's running out of resource, the AV software can 'amplify' this by becoming busy itself with the flurry of extra calls. This pushes a machine over the edge before it normally would reach saturation, causing significant performance issues. Data exfiltration software is a particular problem because it scans the content of files to ensure they don't contain sensitive information, an intrinsically expensive process.

    1. fg_swe Silver badge

      Insane Architecture

      We only need this cr4ap because Windows does not have a modern sandboxing concept.

      A single exploit in Outlook or a Word macro can hose the entire user's fileset plus all of the user's ODBC connections.

      Little wonder we see encryption attacks left and right.

  16. Luiz Abdala
    Joke

    I loved the expression "Canary Release" (in a coal mine) from Google.

    It is better than the Crowdstrike version, "strike a match in a room full of propane to check for gas leaks" release.

  17. JRStern

    Race condition????????????

    I don't grok this race condition, 15-boot idea.

    Are these files recreated or modified dynamically at each boot?

    But then how does deleting them help?

    SMH

    >Maybe next time some staged rollouts? A bit of QA too?

    Doh!

  18. Henry Wertz 1 Gold badge

    kernel errors

    linux still CAN have a full kernel panic. But I've had faulty drivers log nasty messages about derefernces, null pointers, etc, and the system does carry on, the kind of errors that used to cause a full panic can now just have it basically panic that individual driver and carry on. You know, depending on how important the driver was and if it scribbled over memory or screwed the system up further.

    This MIGHT be why crowdstike on linux caused *some* panics and not a panic every time, it may have been logging nasty error messages and keeping the system up the rest of the time.

  19. thatstephen

    I think it is obviously a long debate as to whether kernel access is necessary and how Windows can or can't control faulty code working at the kernel level. I think the far more important failure from Crowdstrike here is in the release 'process'.

    'Cybersecurity provider', 'faulty patch release', '8 million bricked customers systems' should not be in the same sentence unless 'bankrupt' is also in the sentence.

    The issues I have with their process are:

    1) Even without an intentional canary release process there must have been some delay in the release process given that it was at Internet scale, the fact that a very large number of systems were hosed before anybody noticed must have meant that the release must have been an automated process that was unattended or at least very poorly attended. Customers pay a lot of money for services like Crowdstrike. The fact that they can do a release that can do such damage in such a way that they can't press a big red stop button on is not an error but fraud in my mind.

    2) They are paid to provide protection from malware bringing down systems, does none of that 'protection' extend to remediation and bringing systems back up especially if the thing that brought your customers systems down is very familiar to your engineers because they made it.

    3) Creating a bricked system from a release. How on earth does their patch system provide any protection from supply chain attacks or man in the middle attacks? !

    It is also shocking to me that in 2024 when we have virtualization on everything, journaling filesystems, endless disc space to store rollbacks, boot protection at processor level etc. etc. That this can happen to 8 million PCs and servers and more importantly take a lot of time to remedy.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like