back to article CrowdStrike blames a test software bug for that giant global mess it made

CrowdStrike has blamed a bug in its own test software for the mass-crash-event it caused last week. A Wednesday update to its remediation guide added a preliminary post incident review (PIR) that offers the antivirus maker's view of how it brought down 8.5 million Windows boxes. The explanation opens by detailing that …

  1. TReko Silver badge

    It worked on my machine!

    A sanity test of actually installing it on a few real machines before deploying it worldwide to 8.5M machines is something that used to be standard QA practice.

    Crowdstrike's practices sound like criminal negligence.

    1. Joe Gurman

      Re: It worked on my machine!

      And someone at the customer sites installing the update on one (1) testbed system each before deploying to every mission-critical production system also used to be standard QA practice. Still is some places™.

      1. 142

        Re: It worked on my machine!

        Many customers *had* configured such a staggered roll-out for their CrowdStrike updates, but CS actively overruled that setting for this release...

        1. Joey Potato

          Re: It worked on my machine!

          The difficulty with staggered releases is that there is urgency to push out updates for security software. Who wants to explain to their customers that they had an update that would have prevented the breach they suffered, but had not given it to them beforehand for ... reasons. I surely don't want to be on the list for "Not-so Rapid Response Content" for my security software!

      2. Crypto Monad Silver badge

        Re: It worked on my machine!

        Maybe Crowdstrike should release their code to their *own* desktops and servers, an hour or so before releasing it to the rest of the the world.

        1. Andrew_C

          Re: It worked on my machine!

          You expect them to dogfood their software?

          1. Wellyboot Silver badge

            Re: It worked on my machine!

            For software thats running at kernel level - YES.

        2. TheMeerkat Silver badge

          Re: It worked on my machine!

          > Maybe Crowdstrike should release their code to their *own* desktops and servers, an hour or so before releasing it to the rest of the the world.

          Why people comment without any understanding at all?

          CrowdStrike did not release any code. It was configuration, not code.

          1. anonymous boring coward Silver badge

            Re: It worked on my machine!

            Well, in the general sense "code" can mean almost anything.

            That's why I used to be a programmer, not a "coder".

          2. doublelayer Silver badge

            Re: It worked on my machine!

            It really doesn't matter. They should test their releases, no matter what those releases contain, before going public with them. Whether that is code, or configuration, or some other category of data doesn't really matter. If the behavior of the system has changed slightly, testing is required.

            Yes, depending on our definitions, we can argue that it's not code. After all, if someone's counting lines of code from me, they usually wouldn't count the lines of the json file I've just written. But when my program will do something different when the json file is different, it can have the same damaging effects as if I changed what we typically call code, and it therefore needs the same kind of testing.

    2. Anonymous Coward
      Anonymous Coward

      Re: It worked on my machine!

      Criminal negligence is a good term for what went down. I genuinely hope the justice system sees it that way as well. Either way, tons of businesses lost a ton of money, and if CrowdStrike doesn't cough it up, then, unfortunately and inevitably, the customers will. That is not acceptable.

      1. Anonymous Coward
        Anonymous Coward

        Re: It worked on my machine!

        This also demonstrates that contracts don't really help that much in terms of damage. Let's imagine you can get money back from CrowdStrike, it is unlikely to cover the actual damage caused and in some cases perhaps your business is already dead anyway.

        Let's imagine you're using a cloud service for your core IT, they "do a CrowdStrike" and are down for several days, for many companies that's game over and compensation from a contract which may or may not be honoured isn't going to help.

        1. gv

          Re: It worked on my machine!

          The old chestnut "don't put all your eggs in one basket" should always apply to your core IT in all circumstances.

          1. Optimaximal

            Re: It worked on my machine!

            But given the increasing reliance on SaaS, Cloud & External computing, you can't always rely on others also applying that logic/practice.

          2. Pascal

            Re: It worked on my machine!

            How would you put that into practice for what happened here?

            Use multiple different security vendors staggered across your redundant server infrastructure so that any cluster-type service maintain quorum, in case one of them throws a Crowdstrike?

            (That means at least 3 different vendors, licensing, training, managing).

            Make sure critical staff has access to at least 2 different workstations not sharing a single unique component of the critical software stack?

            Protecting yourself against every imaginable incident means eventually you're just juggling hundreds of tiny little different baskets!

            1. Yet Another Anonymous coward Silver badge

              Re: It worked on my machine!

              >Use multiple different security vendors

              We got hit by a ransomware again.

              But I thought we had state-of-the-art security?

              We do, 49% of the machines have Crowdstrike, 49% have MS-defender and one machine runs 'Super Number One Pyongyang security - Great Leader Edition"

              So which one got ransom-wared ?

              1. Anonymous Coward
                Anonymous Coward

                Re: It worked on my machine!

                Well the ones running CrowdStrike are bricked and we had to pay up front :-)

              2. Lord M4x

                Re: It worked on my machine!

                There's an extra machine lurking somewhere..

                49%+49%+1% = 99%

                1. anonymous boring coward Silver badge

                  Re: It worked on my machine!

                  That's the one with Windows 2000 that was just fine.

            2. mistersaxon

              Re: It worked on my machine!

              Or, as we like to call it: "increasing your attack surface"?

          3. JoeCool Silver badge

            Re: It worked on my machine!

            That is exactly why so many of the victims chose N-1 and N-2 deployment policies.

        2. Anonymous Coward
          Anonymous Coward

          Re: It worked on my machine!

          Any damage would need to be pursued independent of any contract anyway. As a software vendor, it'll have some comments in there anyway for faulty software. The most you could hope to achieve from the contract would be service credits against the outage.

          I can't comment directly on CrowdStrike as haven't seen that contract (or my own companies contract with them).

          However, any outage should have been managed as a major incident internally and various DR and business continuity kicked off. This would be covered in any risk register.

          1. Crypto Monad Silver badge

            Re: It worked on my machine!

            > However, any outage should have been managed as a major incident internally and various DR and business continuity kicked off. This would be covered in any risk register.

            But how do you stop your DR environment from getting affected? Do you run no security software, or run security software from a different vendor?

      2. bsdnazz

        Re: It worked on my machine!

        The Crowdstrike software license limits their liability to the software fees paid.

        Computers are general purpose devices and can be put to many different uses. No software vendor is going to refund consequential losses while charging a standard software fee unless they can control very specifically what you do with their software and thus the risk they're exposed to.

        1. Anonymous Coward
          Anonymous Coward

          Re: It worked on my machine!

          The point is that crowdstrike apparently didn't allow customers control of when they applied the update, and thus IMHO they made themselves responsible for the consequence of installing it.

          1. Optimaximal

            Re: It worked on my machine!

            But you *know* that CS will have mitigated any liability in their SLAs.

            1. Anonymous Coward
              Anonymous Coward

              Re: It worked on my machine!

              Then some other options will be activated by Sicilian and Japanese customers.

              And will end in a bloodbath for CS management...

            2. anonymous boring coward Silver badge

              Re: It worked on my machine!

              They can try to mitigate liability in a contract, but in many jurisdictions this isn't actually valid.

              1. Anonymous Coward
                Anonymous Coward

                Re: It worked on my machine!

                I had a family member that was a lawyer tell me that (in the US at least) "you can't sign away your right to sue for negligence". I was asking about the waivers like what are added to contracts for risky activities. He explained that you can ALWAYS sue for negligence. I would think in this case with Crowd Strike, it could certainly be argued that this was negligence.

                Also, remember that a number of the customers that were hit hard will have well funded legal departments. I really don't see this ending well for Crowd Strike.

      3. Gordon 10 Silver badge

        Re: It worked on my machine!

        Not a CS fanboi but you have no way of validating that statement. Without a) having an internal view of the problem and b) more importantly being a corporate technology lawyer rather than some El Reg random Anon.

      4. TheNextBozo

        Re: It worked on my machine!

        DevOps / Criminal Negligence

        Poh-tay-toe / Poh-tah-toe

        1. claimed Silver badge

          Re: It worked on my machine!

          I fail to see how “doing x badly” is the same as “doing x is bad”.

          Bad drivers crash cars… driving is bad?

      5. TheBruce

        Re: It worked on my machine!

        Not a lawyer but where is the mens rea?

        1. flayman

          Re: It worked on my machine!

          Negligence is a degree of mens rea that is less serious than, for example, recklessness.

    3. Anonymous Coward
      Anonymous Coward

      Re: It worked on my machine!

      Here's hoping that the negligence part allows an end run around license agreements that are typically used to avoid coughing up for the kind of brutal cockups we have just seen.

      Not holding my breath, to be honest.

    4. Dave K

      Re: It worked on my machine!

      I would add that deploying a ring 0/kernel level driver that takes input from a regularly updated content file and which does not perform sanity checking on that input file is also criminally negligent.

      Even given their dodgy/insufficient testing processes, this whole mess could have been avoided if the driver validated the content file before attempting to execute it...

      1. Pascal Monett Silver badge

        You're supposing the local driver would be able to detect issues better than the testing suite that was written by the same company.

        I don't see that happening.

        1. Cris E

          Maybe, but I would say that testing suite + local driver tests are better then either alone.

    5. simonlb Silver badge
      FAIL

      Re: It worked on my machine!

      Yeah, you would expect the Validator to at least verify that the file its validating actually contains data which conforms to the format specific to the design of that template, rejecting anything that isn't correct. That's also ignoring the requirement that the service running on the server performs input validation on any file it's ingesting, although that might not be possible due to the way Windows works. Either way, it's a massive fail.

      1. Dimmer Silver badge

        Re: It worked on my machine!

        Re: simonlb

        Wait, that sounds a lot like what an antivirus is supposed to do.

        I thought that I heard something about a requirement that a product has to do what it is marketed to do.

    6. Charlie Clark Silver badge
      Stop

      Re: It worked on my machine!

      They may sound like criminal negligence but I think we'll find out they were perfectly legal, covered by the terms of use and Crowdstrike's software will continual to enjoy immunity from liability along with everybody else's.

      And, if you are going to blame someone, you must include Microsoft for giving third-party code this kind of privelege.

      1. Diogenes8080

        Re: It worked on my machine!

        No, that plays to the Microsoft canard of blaming an open market for security software for the catastrophe. That is in turn a not-so-subtle return the position where MS grant themselves monopoly privileges when writing new software (because only they have complete access to the system APIs). We were arguing that one in the mid-90s.

        I would point out that no Crowdstrike customer was deceived into installing the software, and I would expect that all of them fully accepted that they were granting significant system trust to Falcon. What no-one expected was the shocking lack of software quality that allowed a poorly written _data_ update to crash the software and the machines it ran on. Blaming an automatic content validation tool is no excuse; that approach would not prevent an attack by poisoning the data files after validation.

        Crowdstrike need to fix that flaw before I would trust Falcon on my kit.

      2. redpola

        Re: It worked on my machine!

        I wonder what proportion of comments on this site are in the “Waa! Apple locks everything down and has too many private APIs in their walled garden!” camp versus those now complaining that third parties have TOO MUCH freedom.

        1. anonymous boring coward Silver badge

          Re: It worked on my machine!

          Shhh... Don't mention the fact that people are angry at Apple for protecting people from themselves. No matter how much this is actually needed.

          I agree that Apple can be a pain in the but, but people are amazingly good at screwing up their security... or reliability.

    7. Rich 2 Silver badge

      Re: It worked on my machine!

      Indeed. Automated testing is all very well and good but there is (demonstrably) no substitute for actually testing it for real.

      A famous quote from Knuth (might not be. EXACTLY right);-

      “I have proven that it is correct. But I’ve not tested it”

      1. anonymous boring coward Silver badge

        Re: It worked on my machine!

        There's no substitute for good design, no matter how much testing you do. And this isn't good design.

    8. uccsoundman

      Re: It worked on my machine!

      That's Agile for you! Test it in production. NOTHING is more important than getting the release out on time, and "on time" means yesterday.

      1. Dagg Silver badge
        Coat

        Re: It worked on my machine!

        That's Agile for you!

        Perfect! Just love that agile reference.

      2. Charlie Clark Silver badge

        Re: It worked on my machine!

        I suspect you're referring to the "release early, release often" and "move fast and break things" approaches that have become popular in the last few decades. While often associated with agile development, these have nothing directly to do with it. "Bleeding edge" releases for things like BSD have been there for years with the idea being that they do get tested in the real world, but anyone who puts them in production is ignoring all the warnings and deseves to lose their job. "Move fast and break things" is the mantra that Silicon Valley adopted for building on the MVP (minimal viable prototype) approach which puts gaining market share as quickly as possible above all other priorities, including software quality. "Agile" was supposed to fix problems magically if they occurred to meet the few regulatory requirements and avoid any liability. This worked well enough that the few spectacular failures were deemed acceptable and has gone on to pervert many industries where the maxim "failure is not option" should always be observed.

        If this were agile, it would be easy to rollback to the previous version. QA certainly fucked up but this is just as likely to happen with other methodologies as we see regularly with other software providers.

        1. doublelayer Silver badge

          Re: It worked on my machine!

          "If this were agile, it would be easy to rollback to the previous version."

          And from their side, it was. It didn't take them very long to release the version that patched this. The only trouble was that the buggy version damaged things so badly that the users couldn't revert as quickly as the release process could.

          1. Charlie Clark Silver badge

            Re: It worked on my machine!

            Two things: testing was inadequate; and once again Microsoft's original sin in letting this kind of driver take the system down so effectively. It's not as if the system doesn't have the ability to create save points for rollbacks to preserve integrity.

            And it really is Microsoft's failure and the IT monoculture that are at fault here. If Microsoft does not provide better system integrity then repeats and exploits are only a matter of time. Customers should demand better but can only do so if they can make credible threats to move their custom elsewhere.

            1. Bill 21

              Re: It worked on my machine!

              Er. Crowdstrike has actually managed to take down Linux PCs the same way in at least a couple of smaller scale incidents earlier this year.

    9. UnknownUnknown

      Re: It worked on my machine!

      Agile, Innit!

  2. LosD

    They didn't test the tester! Or the tester tester! Or the tester tester tester!

    Typical *eyeroll*

    1. AndrueC Silver badge
      Joke

      It does appear to have been a testes up.

    2. richardcox13

      I recall for sometime ago at my first job (so early '90s) talking to my manager about test rigs. He talked about how much effort, for a safety critical was put into testing the test rig. The test rig itself was relatively simple, but validating it covered all necessary good and bad cases for the test target was substantial.

      For something you are pushing updates to at a rapid pace, needs more than just a few tick boxes.

      It isn't as if CI/CD pipelines running multiple sets of tests using all sorts of software ( for example on Windows Server 2022) are hard to come about (the VM images for these are freely available.

      Underlying organisational culture about "updates faster, respond fast, Fast, FAST" is likely the real problem.

    3. UnknownUnknown

      They may have ‘done automated testing’ ….but did they actually test on any real world kit representative of what a customer has - did they put it on some PC’s, servers, registers, signage, kiosks, self-checkouts, embedded etc … virtual or otherwise - and do normal things like reboot them ?

      Test kit costs money obvs ….

      That seems completely unmentioned….

  3. Anonymous Coward
    Anonymous Coward

    Yeah, no excuse for not deploying on test machines or cannery channels. They are not talking themselves out of this one. They are quite literally The Man Who Sold The World.

    1. Jimmy2Cows Silver badge
      Pint

      Re: no excuse for not deploying on ... cannery channels

      You want them to test on canning factories first? The same factories that package our beer!?! I dunno... seems awfully risky. You thought CrowdStrike was bad? Just wait 'til the beer cans stop flowing...

      1. Paul Herber Silver badge

        Re: no excuse for not deploying on ... cannery channels

        Real beer is not available in cans!

        1. Wellyboot Silver badge

          Re: no excuse for not deploying on ... cannery channels

          Glass bottles are acceptable when the pub is shut

          1. Angry IT Monkey
            Pint

            Re: no excuse for not deploying on ... cannery channels

            Now that's what I call a DR* Plan!

            * Drink Responsibly / Regularly ->

  4. hoola Silver badge

    Automation....

    This revelation just sums up the insanity of where we are.

    So much is now reliant on software to test stuff with simulations or whatever shite it does that the fundamental concepts of actually a TESTING something in a live environment has gone.

    This is not an oversight or anything like that, it is a disaster that has been waiting to happen. Now it has happened however nothing will change because it is a cultural issue. Too many just will not believe that the old fashioned way of actually installing something to see if it works is ultimately better.

    This is because it is considered 'legacy'......

    Utter morons.

    1. Anonymous Coward
      Anonymous Coward

      Re: Automation....

      I'd say you can blame Microsoft twice for this.

      First for creating an OS and selling it via questionable means that, despite actual terabytes of updates, is still by default more resembling a colander from a security perspective and thus requires all sorts of shoring up with IT plasters and bandages to keep it together, and next for sacking their testers and making it acceptable to push shocking shoddy, not-even-beta-quality code out without any apology or shame and so make it acceptable for other organisations to do the same.

      The problem is that it has never had any real consequences for Microsoft (except with us, but we're such a tiny exception it doesn't even register) - they still get paid. As long as that does not change I do not expect any improvement any time soon.

      This WILL happen again.

      1. ChrisElvidge Silver badge

        Re: Automation....

        Can you blame them a third time?

        For advising keeping your Bitlocker keys in their cloud.

  5. Anonymous Coward
    Anonymous Coward

    I don't understand this sentance

    it reads evasively

    "CrowdStrike assumed that tests that passed the IPC Template Type delivered in March, and subsequent related IPC Template Instances, meant the July 19 release would be OK."

    What didn't they test, and why ?

    1. Strahd Ivarius Silver badge
      Coat

      Re: I don't understand this sentance

      Translation in normal language: "We tested the system once several years ago, and everything was ok, no need to test anything now"

  6. Mishak Silver badge

    Not sure what language they use

    But it isn't normally difficult to capture and handle unexpected exceptions:

    try() {

    instantiateTemplate();

    }

    catch ( ... ) {

    handleBadThingsThatShouldNotHappen();

    }

    Making sure that the exception handler can't throw an exception, of course.

    PS - anyone know how to stop the html code blocks on here from adding space around newlines and removing leading spaces?

    1. MrBanana

      Re: Not sure what language they use

      You're thinking about user code written in a high level langauge, where there is a saftey net in the kernel to catch your screwup, and gracefully return an exception. If you running as a kernel driver you will be wrtting in low level C and possibly assembler, no safety net, no exception handlers, just a hard crash. Made even worse by insisting on running at first boot - total borkage.

      1. G40

        Re: Not sure what language they use

        You should have a look at the code for a modern Windows kernel driver. Might change some of your preconceptions.

      2. Blazde Silver badge

        Re: Not sure what language they use

        That's not really true. Structured Exception Handling works absolutely fine in the Windows kernel. In fact if you write a driver you may well be more exposed to exceptions and end up with a greater appreciation of how important they are for control-flow at the CPU level. They for sure shouldn't be used to catch screw-ups though - that goes for all exceptions doesn't it?

        That said, native C++ exceptions don't work. Python exceptions don't work (duh)... etc

    2. Jon 37 Silver badge

      Re: Not sure what language they use

      The Windows kernel is written in C. C itself doesn't have exceptions at all. Although Windows does have an exceptions mechanism that's kludged in there.

      But, that's the wrong solution.

      In C, you can write to a wild (invalid) pointer, and that might be caught by the OS or might just write to a random bit of RAM. In the kernel, you can corrupt any RAM that way, causing some other part of the system to go wrong (perhaps much later) in an unpredictable way.

      So, you absolutely have to write your code correctly so it doesn't try to write to an invalid pointer. This is not optional. If you're doing that, then you can use the same techniques to make sure you don't read from an invalid pointer.

      And once you've done that, you don't need to try to catch exceptions from using invalid pointers. And you shouldn't even try, because there is nothing sensible you can do if you catch one.

      If you're a C# or Java programmer, then you might not have come across the concept of invalid pointers. One of the big improvements in those languages, is that they ensure that pointers are valid. They don't have raw pointers, instead they wrap them in object references and arrays. That makes this entire class of bugs impossible.

      Rust also makes this class of bugs impossible, which is why the Linux kernel is introducing Rust for some parts. (Both Java and C# use a "garbage collector", which does not fit in an existing kernel easily. Rust doesn't, which makes it a better fit for gradually converting past of an existing kernel to a safer language).

      1. Jellied Eel Silver badge

        Re: Not sure what language they use

        Rust also makes this class of bugs impossible, which is why the Linux kernel is introducing Rust for some parts. (Both Java and C# use a "garbage collector", which does not fit in an existing kernel easily. Rust doesn't, which makes it a better fit for gradually converting past of an existing kernel to a safer language).

        Alternatively.. Leave the kernel well alone? This kinda reminds me of a debate from late last century on my degree. In which Z was inflicted on us to learn formal methods and software assurance. But proof in Z meant not much when we then had to hack away in C or assembler and hope the compiler didn't have any bugs on top of the ones we were writing. Which seems to be the problem, ie the kernel is the core of the OS, so if you want all the cruft that's wrapped around it to have a chance of working, it should be left to it's own devices. Which I guess has been the problem, ie the pressure to allow hooks into the kernel.

      2. Mishak Silver badge

        According to Microsoft...

        Windows drivers are able to handle a number of exceptions, and I have used them when writing kernel mode drivers for high speed (at the time) comms interfaces.

        Sure, writing through invalid pointers can happen, but page protection should be in place to stop that affecting resources that are not allocated to the current thread of execution. It is, of course, much better (and harder) to ensure that invalid pointers cannot be formed.

        1. Jon 37 Silver badge

          Re: According to Microsoft...

          This is a driver that will be intercepting calls to the Windows kernel from any thread, so thread level protections don't save you.

          The plan for recovery from a bad pointer dereference in the kernel is:

          1) Hope it gets detected by accessing an invalid page.

          2) Assume that something is already corrupt. Crash the computer (BSOD) to prevent things getting worse and destroying more data.

          3) Reboot. Hope that fixes it for long enough for a human (or script or automatic update) to replace the faulty driver.

          4. If stuck in a reboot loop, automatically disable all the non essential drivers. Hope that the system boots that way. Then a human (or script or automatic update) can replace the faulty driver.

          5. If still stuck in a reboot loop, it's going to need human intervention at the console.

          In this case, step 4 would have saved the day, but it failed because CrowdStrike decided their driver was "essential".

      3. Martin Gregorie

        Re: Not sure what language they use

        I noticed recently (after an update to Fedora 40) that the current C compiler versions are now silently adding a zero byte to the end of string variables - something that Java has done for years, so this makes a nice improvement. A quick scan of the C compiler's man page shows that the Gnu C compiler now accepts options to control this feature (-Wno-stringop-overflow -Wno-stringop-overread and -Wno-stringop-truncation), so this behaviour is now the default.

        However, I don't recall seeing any announcements about this, so did I miss the them of was this feature just quietly slipped in?

        1. Anonymous Coward
          Anonymous Coward

          Re: Not sure what language they use

          > I noticed recently (after an update to Fedora 40) that the current C compiler versions are now silently adding a zero byte to the end of string variables

          I don't understand what you're asking. C doesn't have a "string variable" type. If you mean string literals then the C compiler has added a '\0' automatically since forever.

          1. Mishak Silver badge

            Array initialisation can cause surprises

            char a1[ ] = "abc"; // Terminated

            char a2[3] = "abc"; // Not terminated!

            A C compiler may warn that the initialiser for 'a2' does not fit. C++ is a bit smarter and will issue an error.

      4. F. Frederick Skitty Silver badge

        Re: Not sure what language they use

        Slight nitpick, the Windows kernel is written in a mix of C and C++. However, it's mostly C. Drivers are frequently written in C++, but certain features - including normal exception handling - are unavailable.

        As for Java programmers not knowing what pointers are, the dreaded NullPointerException means we know of them. Modern Java encourages the use of Optional for return values that may be "empty", making it clearer and handling of them more explicit.

      5. gnasher729 Silver badge

        Re: Not sure what language they use

        “Rust makes that class of bug impossible” - no, it makes it impossible to go undetected. So you tried to access memory that you should never, ever access. Rust can’t fix that. It can make sure the error is detected. What then? You don’t have code to handle it, because it’s not supposed to handle.

        It’s the kind of thing where a safe language will crash your app to make sure nothing worse happens. Many apps, crashing is quite harmless. In this situation, it’s the worst you could do (but there is nothing else either). So Rust wouldn’t help.

        1. Blazde Silver badge

          Re: Not sure what language they use

          By 'detected' I think you're alluding to bounds checking, which is somewhat different to an unsafe language throwing an Access Violation/segfault only when it hits memory that isn't even there. In the latter case you have no idea if the buggy kernel driver stomped all over important OS data before causing the exception. With Rust bounds checking there may be some logic error further upstream but your memory structures remain uncorrupted. Having a 'wild index' into your container is very much less dangerous than having a 'wild pointer' into raw memory, which can't happen in safe Rust code.

          So it is feasible to recover by restarting/reinitialising the driver and let the kernel carry on unharmed, though that depends on expecting the unexpected and coding that behaviour. I think the accepted wisdom is to try to produce driver code which can never panic - that is all potential bounds checks and other panic situations have explicit handling (eg. for a typical container access you simply use .get() and deal with a potential None return if your index is out-of-bounds, rather than [] indexing which panics). That's quite daunting to achieve for anything complex.

      6. JoeCool Silver badge

        Re: Not sure what language they use

        "Rust also makes this class of bugs impossible, which is why the Linux kernel is introducing Rust for some parts"

        I don't think so. The quote I recall from Linus is a little more sanguine , sometihing along the lines of 'it probably won't hurt to have rust code possible in the kernel'

        I do not recall any "roadmap" or project justifications for adding Rust support.

    3. Anonymous Coward
      Anonymous Coward

      Re: Not sure what language they use

      Based on what I see on a regular basis, I expect that the catch will only log that there is an error, and not do anything useful...

      And probably the logging part will crash the computer.

  7. Anonymous Coward
    Anonymous Coward

    Who tests the tester?

    This is a bit silly really. You could have tests for the tester then tests for the tester tester then tests for the tester tester tester. This could go on forever. What happened to having a team to test rather than relying on what we now know to be faulty test automation? How can you even have test automation when like in this instance the fault was unknown to the test automation so never got flagged as a fault? I think it's boils down to age old adage of paying money and ways to avoid it. Why have actual testers when we can do it without them or fewer for less? I can even imagine the meeting they had at some point where they talked about test automation and the money they could save by laying off staff. Pats on the back all round chaps.

    1. Jellied Eel Silver badge

      Who tests the tester?

      Richard Feynman wrote some good stuff on testing. Take 2 teams, one to make it, one to break it. It's one of those things where subconscious biases can affect things. I design something to the best of my ability, work through a bunch of failure scenarios and pass it off to another team, who promptly think of something I didn''t think of and break it.

      This is a bit silly really. You could have tests for the tester then tests for the tester tester then tests for the tester tester tester.

      Yep. But break it down with your trusty Occam's Razor and you get a faulty o-ring. Or in this case, assuming a simulated test was a real test. It sounds like there wasn't actually any pre-deployment test by letting the update loose on a bunch of test environments and seeing what happened. Around 8,5m systems found that out the hard way.

      1. Anonymous Coward
        Anonymous Coward

        It's funny but when I was working as a data analyst I used to ask my boss to sanity check my work and he used to say to me all the time why don't you do it. To which I would reply, I've already done that and checked it 3 times with as many tests and reconciliations of the data as I can think of but I wrote it so I think it's right anyway. He got the idea in the end and on the very rare occasion there was a cockup it was caught.

        As with this and pre-deployment tests I'm going to assume they run the software on their own systems so you would think they would at least let it loose on their own systems first after all the testing is done before firing it out to the world. Like a final sanity check but I guess not.

        1. Will Godfrey Silver badge
          Facepalm

          Of course not! If they did that and it crashed it would be a nightmare to fix... OH!

      2. Tony W

        Seems very like a major problem that Feynman identified: because it worked a few times, it must be OK. It's a fallacy that humans are very prone to, in all sorts of situations.

  8. Admiral Grace Hopper

    Test, test, then test again.

    I'm explaining the value of testing to a bunch of wanna-be junior techies this week and the CrowdStrike shenanigans have been an excellent example.

  9. sitta_europea Silver badge

    Has anybody else noticed that "Safe Mode" presumably means the other mode - the one you normally use on Windows - must be "Unsafe Mode"?

    Do you think the techies in 1995 ran the name past marketing first?

    1. Admiral Grace Hopper

      The same techies that put the Shut Down command on the Start menu?

      1. david 12 Silver badge

        The same techies that use the ignition switch for turning OFF the car?

        1. F. Frederick Skitty Silver badge

          It's a switch, so has two or more states. One of those states turns the ignition system off, so I think you've got a classic "bad car analogy" there.

        2. An_Old_Dog Silver badge

          Ignition Switch Versions

          v1.0 - a two-position key-operated switch. The positions are "OFF" and "ON". A childhood friend's father had a then-old Chevrolet truck with this type. You stepped down on, and held down, a separate floor-mounted normally-open switch to run the starter motor.

          v2.0 - a four-position key-operated switch. The positions are named (left-to-right for dashboard-mounted sub-types, and stern-to-bow for steering-column-mounted sub-types) "OFF", "ACC", "RUN", and "START". "ACC" is for "accessories". Turn the key to START and hold it there for as long as you want the starter motor to crank. Releasing the key spring-returns the key to the RUN position.

          v3.0 - a normally-open, dashboard-mounted, or touchscreen-implemented push switch. Press it and hold it down to crank the starter motor. This switch has no effect unless the car's RF key fob (or a reasonable cline thereof) is "in range".

      2. Derezed
        Headmaster

        Shut down on start menu legitimate UI.

        Start is where you start the task you want to accomplish…in this case shutting down.

  10. Julian Poyntz

    Next assume, never promise

    Two things I learnt waay back in the past and have been so true all these years

    1. Bebu
      Headmaster

      Re: Next assume, never promise

      Never assume

      Assume is a unary logical operator that invariably negates the following assertion when said assertion is not tested.

      I assume that makes it a second order operator. :)

  11. Julian Poyntz

    break testing

    the line "worked in March", is a bit of a bell ringer. Why think it would still work now ?

    I also wonder was "break" testing they did. See it so often with testing where things are tested to show it works as expected, that when something happens that should not (such as a button going missing) is missed.

    However, reading a config file of any type without internal validation is really rather worrying, thouh I imagine the excuse is "To keep it as small as possible", which would not be the first time to hear that excuse, but looking at the size of numerous files it is complete bollocks

  12. Anonymous Coward
    Anonymous Coward

    Secure boot?

    The details don't make clear if these data files were signed and validated by the driver in any way - if they're not then surely this could be a secure boot violation, given (clearly by the BSOD encountered) they can cause operations to be executed with kernel privileges...

    1. Steve Graham

      Re: Secure boot?

      "The details don't make clear if these data files were signed and validated by the driver in any way"

      Clearly not. The file that caused the driver to fall over was corrupt. (I think I read somewhere that it was full of zeroes, but I might be mis-remembering that.)

      1. doublelayer Silver badge

        Re: Secure boot?

        That doesn't necessarily preclude signing. Step 1 makes a file. Somehow, the output was corrupted during or after step 1. The file is then passed to step 2, which signs it. The result is a file that is properly signed, verifies just fine, and once the signed content is extracted, it's still invalid. Steps 1 and 2 don't even have to be separate processes, though they probably are.

  13. tallen
    Facepalm

    Asking for a friend: So just how do you read the release notes before the auto-update tanks your machine?

    1. Tom Chiverton 1 Silver badge

      Reboot 15 times.

      Seriously.

      1. gnasher729 Silver badge

        There is a very good explanation for this.

        You have a data file that crashes crowdstrike while booting. There is a new data file available that doesn’t do that. Every reboot you have a tiny, tiny chance that the new file gets downloaded before the crash. 15 reboots and you apparently have a reasonable chance.

  14. ComputerSays_noAbsolutelyNo Silver badge
    Joke

    Testing as a Service

    Big customers, who stand to lose much, could send a small sample of their machine park to DownStrike, so that new updates can be tested prior to the full roll-out.

    Evil Marketing guy: That's such a nice operation, you have running. It would be such a shame if some bad update would happen to it.

  15. xyz Silver badge

    Ah "the dog ate my homework" excuse...

    As others have noted, nothing beats real machine testing. I tested some code once (not my code) on about 5 machines and it would work, or it fell over, or it would work etc. I nearly got done for sabotage... Turns out the previous testing had been done on vms and what I'd exposed was a showstopper and my execution was cancelled.

  16. glennsills@gmail.com

    Missing the point

    It is interesting to know the reason CrowdStrike missed the bug in its software, but it does not matter that much. Bugs are going to happen. The real problem illustrated by this disaster is the fact that organizations like Delta Airlines allow automatic updates to their systems without testing them first. I understand that inside places like Delta Airlines people are desperately searching for someone to blame. Placing blame will not prevent this sort of mass outage from happening in the future. Switching to another OS or another programming language will not prevent it either. What is needed are operational changes to how software is deployed. Automatic updates to systems that require continuous uptime is the root problem.

    Unfortunately, CEOs get rewarded for taking bold risks rather than being prudently cautious. The soon to be ex-CEO of Boing is a good example. Under his direction, Boeing became less of an "airplane manufacturer" and more of a "profitable corporation". This was a risk but in the short term, increased profits were almost guaranteed. Planes fell out of the sky. The CEO was "punished" with a 33 million dollar payout package. As long as CEOs of major corporations continue to be thus incentivized, future computer mass outages are assured.

  17. Jamie Jones Silver badge

    How can we trust ...

    How can we trust that this software doesn't mess up things in more subtle ways?

    It runs in kernel space - what about potential bugs that may make subtle corruptions in memory, that may adversely affect data in other programs but goes unnoticed?.

    I'm not comfortable with the thought of third party modules having this sort of access without any of the scrutiny that should be applied.

    1. Anonymous Coward
      Anonymous Coward

      Re: How can we trust ...

      According to some people working for government agencies, CS sends back lot of data to the mothership...

  18. steven_t
    FAIL

    Four main issues

    From reading CrowdStrike's explanation, it seems to me there were four main issues:

    1) The content interpreter (running on 8.5 million Windows endpoints) can render the machine unusable when it reads invalid data in an IPC Template Instance - this indicates a QA problem in a critical software component and is quite concerning.

    2) CrowdStrike's policy is that IPC Template Instances can be rolled out to 8.5 million endpoints without any testing, as long as they pass the checks in the Content Validator. This appears to reveal a staggering degree of complacency by management.

    3) The Content Validator contained a bug which allowed invalid IPC Template Instances to pass the tests - this indicates a QA problem in this software component, which wouldn't normally be considered critical, except for the policy of not requiring any other testing.

    4) Someone created an invalid IPC Template Instance and submitted it for checking and release. This is an everyday type of mistake which should have been caught by QA processes and tools and, as a last resort, by validation within the content interpreter.

    There were other things which could have been done to reduce the impact, such as allowing customers to know what was in each channel file and decide for themselves how they wanted to deploy them but, given that this wasn't part of the business agreement, I think the four points above are the main failings which led to this catastrophe.

    I don't have any connection with CrowdStrike, either as an employee/contractor or as a customer/user/victim. I also wasn't significantly inconvenienced by it, as I wasn't planning to use a plane, train, doctor or any of the other services which were affected.

    1. Strahd Ivarius Silver badge
      Devil

      Re: Four main issues

      Fire the intern who created the invalid IPC template!

  19. Bebu
    Windows

    my surprise...

    was that only 8½ million Windows machines were affected. I would have imagined ten times that many. I suppose the wrong hundred thousand machines could wreak havoc while ten million personal devices could be an inconvenience.

    1. that one in the corner Silver badge

      Re: my surprise...

      > I suppose the wrong hundred thousand machines could wreak havoc while ten million personal devices could be an inconvenience.

      A key point here is that CrowdStrike is only installed (barring a few home users with more money than sense - or a "borrowed" work key) by companies, and generally larger ones at that.

      So personal devices (barring ...) were never at risk from this cockup - instead it was going to be machines that stopped one part of a (big) company doing something, which stopped their colleagues doing something else which...

      1. hoola Silver badge

        Re: my surprise...

        Crowdstrike has done some absolutely brilliant marketing and sales targeting C Suite plonkers and best of all Cyber Insurance outfits. It is the latter that is the secondary cause of this and is why they are so endemic with their snake oil shite. Carbon Black is a similar product in a similar situation.

        We had a huge panic over a malware/encryption/compromised client due to use stupidity. We had to install Carbon Black on all our servers as part of the response from the insurers. This incurred no exceptions anywhere so we were now at huge risk of the backups being duffed up be the very product trying to find malware. Eventually common sense prevailed and a compromise was reached.

        Nothing was found after weeks or maybe months.

        These teams run by insurers (mostly highly paid but fundamentally non technical consultants) drive a lot of this crap and make things worse. But, and this is the killer, to get the insurance you have to use approved tools, in the event of an incident you have to use approved tools and that means any that is sold as not being "legacy"

        And that is the snake oil, anything now that is "cloud native" is seen as infallible and the traditional products as legacy even if the latter works better.

        As an aside both of the cloud products I reference could not even detect an eicar file or a test malicious script inseted into a web page.

        Both were reliable detected by our existing "legacy " product we were under pressure to replace. How do I know the new product actually works? The answer was that you did not unless it detected something.......

    2. Anonymous Coward
      Anonymous Coward

      Re: my surprise...

      For an European company I work with, only the people who were traveling in Asia at that time were infected.

      And of course also the poor IT manager who got a wake-up call at 06:30, started her computer to join the meeting and saw it crash a few minutes later...

      People who started their computer after 08:30 didn't get infected.

      All servers running 24/7 were down however.

  20. first-last
    Mushroom

    Parse, don't validate

    It seems the Crowdstrike programmers haven't heard of this gem: (or the boss didn't allow it).

    Parse, don't validate[1]. Summary: Parsing is the act of going from less typed data into more typed data (or an error). After (proper) parsing you know that the data is valid!

    It ties in with Langsec[2]. Summary: LangSec states that every input to a program is actually a (little) language. That's best handled with a parser. See point 1 why.

    Failing to do proper parsing leads demonstrately to --->

    1: https://news.ycombinator.com/item?id=35053118

    2: https://langsec.org

    1. that one in the corner Silver badge

      Re: Parse, don't validate

      > Parse, don't validate

      No.

      Parse *then* validate.

      > After (proper) parsing you know that the data is valid!

      Successful parsing says the data is grammatically well-formed, *not* that it is valid.

      my_age = 264

      may well parse, but it ain't valid. Heck, depending upon the grammar,

      my_age = purple

      may parse quite happily (I am, of course, a super-intelligent shade of the colour blue, not purple in the least).

      You can add semantic checks into the erstwhile "parser" code, but unless you have put those into the grammar (e.g. there is only a fixed set of colours my age could be) then you are just mashing up the terms used to describe what your code is doing.

      In particular, validation of data can (often does) involve cross-checking with other data, which need not have gone anywhere near your parser.

  21. Valeyard

    automation testing is fine and good for most cases, but afterwards the release still goes to some testbed/pre-prod first for a manual eyes-on sanity check before deploy

  22. gnasher729 Silver badge

    Bug in a validator not validating things is one thing.

    But surely some developer who created that file must have tried it out and must have checked that it does what it is supposed to do?

    I read it was supposed to prevent some use of named pipes by malware. So I would have expected that some developer set up some malware, checked that it successfully used these named pipes, implemented the change, and verified that the malware now failed to use these named pipes named pipe. And while testing this they would have noticed a crash during boot.

    So apparently they tried to fix a problem, and the developer couldn’t even be bothered to try out that the problem was fixed?

    1. that one in the corner Silver badge

      > So I would have expected that some developer set up some malware, checked that it successfully used these named pipes, implemented the change, and verified that the malware now failed to use these named pipes named pipe. And while testing this they would have noticed a crash during boot.

      Go back over your scenario and compare it to Joe Bloggs's PC on Friday morning, as it BSOD.

      What is different?

      Joe:s machine does *not* have any pipe-using malware.

      So, how about the dev "didn't* notice a crash during Boot, because it didn't crash, but spotted the malware and dealt with it. Maybe even called over the PHB to demonstrate the positive case. Job done, sign off, release update.

      Whoops, tested the true positive, demonstrated the clever stuff worked. But forgot to test the negative condition. The one that most Users actually have.

      If that happened, it would still be a QA failure, of course, but, be honest, who hasn't forgotten to test the negative condition, at least once.

      And "once" is the number of times that poor sod of a (hypothetical) Dev would forget.

      1. gnasher729 Silver badge

        No, what has been reported was that just _reading_ the new file crashed. It never got far enough to take any action based on the content. Old file: check for condition 1, 2, 3 and 4. New file: check for condition 1, 2, 3, 4 and five. “Five” instead of “5” crashes. Whether any malware tries to do any of these things doesn’t matter.

    2. OhForF' Silver badge

      Well it is possible the developer checked and it worked fine in the development environment while the live environment used different validators and "only" one of the validator on live chocked on the new file. Still a big fail not at least testing it once with the same environment as live (e.g. by rolling out to a single machine) before rolling it out to millions of machines.

    3. Falmari Silver badge

      @gnasher729 "So apparently they tried to fix a problem, and the developer couldn’t even be bothered to try out that the problem was fixed?"

      I not saying you are wrong, but while the release that was deployed to channel was certainly not tested to see if the problem was fixed that does not mean the developer did not test that the problem was fixed.

      What the developer tests is not what's deployed for release. The developer tests and then checks their work in. What comes off the pipeline that deploys for release has to be tested because there is no guarantee that something has not gone wrong between developer and what's to be released. We always test what's going to be released don't we?

    4. doublelayer Silver badge

      I'm guessing there was a difference between the version that was tested and the version that got released. That could happen in a lot of ways. Maybe two changes were merged into this file and building them together makes the bad file. Maybe it had to do with some additional content in the production build which isn't present in the debug build. There are plenty more.

      I've seen the latter example from time to time. For instance, a task where two people wrote code. First, my coworker wrote one unit, then sent it to me. I wrote the second unit. In my testing, these units worked just fine together. Correct results, no crashes, positive and negative results handled as expected. Build it for production and the automated tests freak out instantly. The reason: my debug build was writing more to the log file in case anything went wrong. That slowed things down slightly, which was enough to prevent the race condition in the two processes from going wrong. Take out the logging and the processes might have a concurrency problem and fail. But it worked just fine on my machine. Probably it would have failed eventually if I ran it with the extra logging enough times, but it didn't in the maybe thirty runs I actually did.

  23. Anonymous Coward
    Anonymous Coward

    "engineers to leverage in Rapid Response Content."

    And I claim another full row in bullshit bingo. "Leverage" is not a verb, no matter from which school of marketing twaddle you graduated.

    1. vcragain

      No - it's a great example of new usage of an existing word - how language changes over time ! We all know exactly what the user means ! Language purists screw themselves up on things like this all the time. Language is an ever-evolving thing, subject to whatever new variations of it get introduced as seems fit. We now barely understand what English speakers of ancient times meant by what they wrote !

      1. Anonymous Coward
        Anonymous Coward

        Like the "new meaning" of the word "knobhead"

      2. Cav Bronze badge

        "how language changes over time"

        Language should evolve to cover new uses, not stupidity and laziness. There is no need to use "leverage" as a verb (which it isn't) when the verb "to lever" already exists.

  24. JRStern Bronze badge

    Wise man say measure twice cut once

    Yeah well, when you find a bug like this you fix it twice.

    First, put in an error handler so if that kind of thing happens again it is handled soundly without BSOD.

    Second, fix whatever the problem was here.

    1. Strahd Ivarius Silver badge
      Coat

      Re: Wise man say measure twice cut once

      It is still forbidden to shoot the CEO, is it?

  25. Anonymous Coward
    Anonymous Coward

    What is old becomes new again

    Not quite a graybeard, but I am getting there...

    This whole debacle is history repeating itself. Seasoned endpoint security (Trend, McAfee, Symantec, etc) all went through the same sorts of growing pains back in the late 1990's. The lessons CrowdStrike is now learning (like staggered updates) were also learned the hard way by those vendors decades ago.

    CrowdStrike simply doesn't have that history in its DNA to know where all the pitfalls lie.

    1. sgj100

      Re: What is old becomes new again

      But they do! Crowdstrike's CEO, George Kurtz, was the CTO of McAfee in 2010 when McAfee did something similar!

      1. Dwarf

        Re: What is old becomes new again

        You expect a CEO to understand the detail ?

      2. anonymous boring coward Silver badge

        Re: What is old becomes new again

        The generally hated McAfee that McAfee himself ridiculed?

      3. hayzoos

        Re: What is old becomes new again

        Lemme guess, gave the same order then as now.

        Publish first at all cost.

  26. sgj100

    Fuzzing

    The preliminary incident report from Crowdstrike says that they in the future they will be adding fuzzing to their testing process. Why the **** wasn't this already the case?

    1. anonymous boring coward Silver badge

      Re: Fuzzing

      If that's their solution I suggest not using their products any longer.

  27. Boris the Cockroach Silver badge
    Unhappy

    Get a job in

    QA , if anyone is employed to do it anymore.

    We have mission critical stuff deployed all the time , ok its not to 8.5 million PCs , but the results for us could be devastating, or even lethal.

    Thats why we test our programs on the CAD/CAM while programming, why we test the output of that on simulation software, its why we take great care in making sure the highly expensive machinery does what we tell it to, only then do we unleash the machinery to produce 500 widgets an hour(or whatever we're making), and why anyone making changes has to have a rollback position and the knowledge to ensure that he/she has not made a booboo such as releasing an untested config file.(or ramming a 2" diameter drill into a chuck running at 3000 RPM.... makes a diaper changing day that does... for all the staff).

    And the aerospace stuff........ you dont even want to see the mountain of paperwork we have to fill out to do that...test test and test again to prove we've done our job right....

    If only the anti-virus(and indeed most of the software creation companies) were held to those sort of standards....

  28. dmacleo

    Uber Eats

    https://techcrunch.com/2024/07/24/crowdstrike-offers-a-10-apology-gift-card-to-say-sorry-for-outage/

    On Wednesday, some of the people who posted about the gift card said that when they went to redeem the offer, they got an error message saying the voucher had been canceled. When TechCrunch checked the voucher, the Uber Eats page provided an error message that said the gift card “has been canceled by the issuing party and is no longer valid.”

    nice....

  29. Snowy Silver badge
    Holmes

    Friday why on a friday.

    What puzzled me was they released it on a Friday.

    Microsoft have patch Tuesday, do check on Monday, release on Tuesday then you have the rest of the week to fix the problems.

  30. pip25
    Stop

    I don't care

    Stop feeding kernel mode code with data downloaded from the Internet, you idiots.

    We can't rely on a vendor not screwing something up (as evidenced by this debacle), but all validation and certification from third parties is useless if the validated code can be reconfigured dynamically by crap downloaded from somewhere. This needs to stop. Now.

    1. JRStern Bronze badge

      Re: I don't care

      The battle between static and dynamic goes on forever, LOL.

      Soon as you lock something tight, someone installs a back door ...

    2. anonymous boring coward Silver badge

      Re: I don't care

      This. And any extensions, no matter what ring they run in, mustn't be able to make the system unbootable.

  31. Ace2 Silver badge

    This kernel module is loading something from the filesystem. So we know it can, and that presumably the directory is trusted.

    Why not:

    1. Write out canary.txt - “I’m loading channel file X”

    2. Load channel file

    3. Delete canary.txt

    If the kmod loads and finds a canary.txt, don’t load the channel file it lists!

    There, global catastrophe prevented.

    1. gnasher729 Silver badge

      The canary.txt: iOS has a feature where you can write some description of your program state before an app exits, and that state is restored when the user launches the app again. If there is a crash while restoring the app state, that program state is automatically deleted so the use doesn’t end in a loop where the sop instantly crashes on every launch attempt. Using this method.

  32. Henry Wertz 1 Gold badge

    2 points

    1) Others already pointed out that they should obviously deploy to test systems,

    2) Blaming bad templates doesn't change that their software apparently flips out when it sees a bad template rather than, you know, not. They should have better failsafes in there!

  33. Andrew_C

    Obviously they had to have cooked up royally for this to happen, but somehow the more I learn about this the worse Crowdstrike looks

  34. herberts ghost

    Perhaps include some in-house Azure systems

    Cloudstrike should host servers from Microsoft, AWS and othet cloud providers in their internal testing. E.g. disk drive vendors have array controllers from Netapp, Dell, HP, .. to test drive updates before they are unleashed upon those costomer's costomer's systems.

    Also, auto- update is antithetical to reliability and security. Hanging back a day and listening for screams from early adopters is a good idea.

  35. anonymous boring coward Silver badge

    It's not the missed issue that's the real problem.

    The real problem is that a bug in a 3rd party add-on can render the machine unbootable.

    MS need to seriously do some thinking here, and force some different approach being used by the anti-virus vendors.

  36. Richard Pennington 1
    FAIL

    Congratulations on your new job!

    Congratulations on your new position as a Beta Tester for CrowdStrike.

    What do you mean, salary? You want to be paid?

  37. iliketech

    No Test environment systems for a sanity check?!

    Staging all software updates to pass through a “Test” environment is basic Quality Assurance procedure. Especially for a kernel driver! For anyone not familiar with this, it would simply be a group of PC’s in a lab that receive the updates before a release to customers. If they all go offline, the update is (obviously) a dud.

    This is the software equivalent of the Titan Submersible.

  38. Roland6 Silver badge

    The post doesn't detail the content validator's role

    The post also doesn't detail whether the "Content Validator" is a person sifting through an email inbox; we've just assumed it is an automated process and not a human mediated workflow...

  39. hayzoos
    Joke

    Automated update distribution

    I thought of a spinoff of the suggestion to test on their own systems. Make sure the distribution system is in the test group. Then a catastrophic crash will render the distribution system unable to distribute the problem update. Problem solved.

  40. flayman

    I suppose we should thank them...

    ...for aptly demonstrating how dependent our IT infrastructures are on trusted vendors, and how vulnerable they are to wild defects. When that trust is misplaced, as was the case here, really bad results can occur. It's something like "who watches the watchmen". The QA processes for an entrusted security vendor need to be far more robust than this episode suggests. I suppose it could have been a lot worse.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like