back to article CrowdStrike hires outside security outfits to review troubled Falcon code

CrowdStrike has hired two outside security firms to review its threat-detection suite Falcon that sparked a global IT outage last month – though it may not have an awful lot to find, because CrowdStrike has identified the simple mistake that caused the meltdown. News of the external review emerged in a root causes analysis [ …

  1. Nate Amsden

    what happened

    I think regardless of the bug, the core problem was pushing obviously very untested code/data to all customers production systems, even customers that had policies regarding staging stuff to non critical systems first. But even then, this sounds like a bad enough bug that it should have been caught before going anywhere outside of crowdstrike. Hell, push it to crowdstrike's own systems first. Not sure how long it took between the update being installed and a BSOD (never used crowdstrike or similar "EDR" I think they are called tools).

    1. Apemantus

      Re: what happened

      Surely they should have had initial disposable canary devices? Then ‘special internal human friends’ (also canary devices!). Confirm correct deployment and voila loads of tested BSOD. Astounding.

      1. hoola Silver badge

        Re: what happened

        That needs foresight, costs money and needs real people to interact with them.

        They had an automated system that did all of that "NHI".

        1. NoneSuch Silver badge
          Coat

          Re: what happened

          "the missing 21st parameter – as a pointer, which caused it to touch unallocated memory and ultimately crash the operating system."

          As long as the touching was consensual, my outrage is mollified.

          1. spireite Silver badge

            Re: what happened

            Surely they would have , in that case case, experienced an error

            "Overflow error fondled"

        2. TReko Silver badge

          Re: what happened

          Crowdstrike's response sounds like they paid the third party consultants to make them look good.

          The facts are clear: They did not test this particular release before releasing it to the world.

          Sure, they may have tested the sub-components separately, but anyone who's been in IT or engineering knows that one must test the whole friggin' system.

      2. Anonymous Coward
        Anonymous Coward

        Re: what happened

        (Over simplified story to protect the ignorant.) We released a patch for our cell phone app. The app allowed staff to talk over WiFi Internet (or cell data) which turned out to be incompatible with EIR back haul routers on the Irish mobile network. This caused grief with a VP on vacation in Ireland who being on that carrier could not call someone.

        He had been on a golf course fairway, could not get through and on his return, tore us a strip. My boss at the time, listened and nodded. VP: "You need to test this stuff before release!" More nodding. My boss was two years from retirement and could comfortably walk out the door anytime, so he had a low GAF rating. We are in North America and have no Irish business presence. It worked perfectly over WiFi so a high-pri fix for the 15th hole was low on his agenda.

        Now, the fix was simple. Ten minute job, once we found the issue in logs and amended the build. The VP was not in our chain of command and no one in the dept liked him. Before release of the next patch build, our boss put in a request for two business air tickets to the same city in Ireland (overseas trips were always upped to business for all staff) and three nights hotel for "Irish mobile network patch testing." Then sent it to the VP for approval (coded to come off the VP's budget, natch) while BCC'ing our dept.

        Denied, of course, but smiles for a week, at least.

    2. HMcG

      Re: what happened

      Beyond that, what really horrified me is what seems to be a complete lack of any crash log checking and safe rollback.

      If you are messing around with your customers critical systems at a privileged kernel level, there’s an absolute duty to have a watchdog monitor that’ checks your drivers crash logs and safety rolls back any updates, before any such updates are loaded again.

      There seems to have been a complete lack of any such function. At the very least, I hope that Microsoft revoke their boot-driver flag privilege, as Crowdstrike have not taken their duty to do no harm seriously enough.

      1. spireite Silver badge

        Re: what happened

        What surprises me is that this suggests a lack of in-depth testing has always existed... in which case they have arguably been damn lucky not to have a meltdown before.

  2. Ace2 Silver badge

    Doesn’t matter. They won’t exist six months from now.

    1. Claptrap314 Silver badge

      What kind of odds are you laying on that? You realize you are betting in favor of the intelligence of management?

  3. TReko Silver badge

    Consulatant Jackpot Time?

    One hopes many of these consultants are the US developers they fired in January before moving development to India in February 2024.

    1. Bebu Silver badge
      Coat

      Re: Consulatant Jackpot Time?

      the US developers they fired in January before moving development to India in February 2024.

      Another Clown Stroke? What?

    2. cookieMonster Silver badge
      Windows

      Re: Consulatant Jackpot Time?

      If this is a direct result of offshoring, then I’m very happy for them. I hope they loose everything.

      1. Anonymous Coward
        Anonymous Coward

        Re: Consulatant Jackpot Time?

        Offshoring - bane of my life.

        Glossary entry

        Offshoring - A business methodology that values incompetence over expertise.

  4. heyrick Silver badge

    This parameter count mismatch evaded multiple layers of build validation and testing

    How? That's like building a car with only three of the four wheels installed and then saying "but it evaded all of our checks during manufacture"...?

    Also, blindly following a bogus memory address when running into the missing parameter? No input sanitisation? And if it went bang so dramatically in the real world, how could it have gone undetected through all of the validation and testing claimed to have been done?

    1. Bebu Silver badge
      Windows

      Re: This parameter count mismatch evaded multiple layers of build validation and testing

      Fundamentally the Falcon code was far more insecure (and dangerous) than that of the assets it sought to protect (especially in the case non-Windows systems) and when coupled with poor release and deployments practices a visitation from Mr Cockup was long overdue.

      Like most of the security theatre industry all talk and sans pantalon ou culotte.

    2. Snake Silver badge

      Re: This parameter count mismatch evaded multiple layers of build validation and testing

      From interpreting the timeline, what happened is that:

      - first, they created templates over time that only required 20 input variables

      - second, they provided input files that only provided those 20 inputs

      - finally, they updated the template to require 21 inputs

      Boom, Bob's your uncle: they didn't notice / test the fact that the input file only had 20 variables when they, all of a sudden, hard-coded a demand for 21 for the first time. Memory errors ensued.

      1. Shalghar Bronze badge

        Re: This parameter count mismatch evaded multiple layers of build validation and testing

        And neither input sanitizing nor proper input mismatch diagnosis/error handling or fail to safe. Instead, run around out of bounds in memory areas youre not supposed to be. Glorious.

        That begs the question if they at least have some kind of live bit or similar monitoring or if they just sit around and wait for customers to grab the traditional pitchforks and torches.

        At least a nice reminiscence. Back, back in the days, when you tried to read more data than defined, the good old C64 responded with a cheery "?out of data error". Didnt crash the system, though.

        So nice to see that "high tech" and high cost "security" software cant even pop up such a simple message nor get their "testing" to actually test everything they changed.

        1. heyrick Silver badge

          Re: This parameter count mismatch evaded multiple layers of build validation and testing

          "Didnt crash the system, though."

          As I understand it, the errant code was so deeply intertwined with the kernel that it wasn't possible to throw an error. All it could do was shit it's pants and wait for mummy, which is arguably better than trying to continue beyond a critical fault. After all, the blue screen is the machine's way of saying "Oh, fuck".

          "Glorious."

          Yeah, rather than sanitise the input and try to fail safe (even if they meant "ignore this and do nothing but warn the user") it went for the epic fail option.

          1. LessWileyCoyote

            Re: This parameter count mismatch evaded multiple layers of build validation and testing

            "As I understand it, the errant code was so deeply intertwined with the kernel that it wasn't possible to throw an error"

            Pure speculation here, based on a completely different OS and architecture, but whatever: one additional factor that might have contributed to the instant disaster is that the uninitialised pointer held the hex value 9c. If you use that as an address, my thought is that it is so low in the memory space it might be something that is only supposed to be used during OS load, and any attempt to access it, even by the kernel, causes an immeduate protective stop because it's a "this must never happen" condition.

    3. Duire O'Fender

      Re: This parameter count mismatch evaded multiple layers of build validation and testing

      Clown's Trike only had 2 wheels. What a circus.

  5. 9Rune5

    Wrong experts

    CrowdStrike hires outside security outfits to review troubled Falcon code

    That is very vague.

    They need a bunch of highly experienced kernel driver developers to sort out their mess. A computer that doesn't boot is fairly secure and mission accomplished I guess.

    1. CowHorseFrog Silver badge

      Re: Wrong experts

      if they cant even test a regexp properly you can probably safely assume a large portion of the code shoudl be rewritten because it will be full of many bad things(tm).

    2. Doctor Syntax Silver badge

      Re: Wrong experts

      This involved problems at all sorts of levels coming together. It's not just kernel development skills that are involved. As much as anything it's the manglement decisions that allowed everything to come together this way that need to be looked at.

  6. Will Godfrey Silver badge
    FAIL

    Talk about rookie mistake!

    The project I work with has no security implications at all, yet even that will reject a file if the parameters are missing or out of range.

    1. silent_count

      Re: Talk about rookie mistake!

      In a similar vein...

      I'm currently writing a HTML canvas-based platformer game. It has nothing to do with any kind of critical anything. But the level data is validated, the scripts within the level data are validated, and if anything is not valid the game engine will not run that level. There are also runtime checks so that, for example, if a script tries to spawn an enemy outside the level's bounds, the creature will not be spawned. Nothing is allowed which will put the game into a potentially invalid state. And this is just a dinky game which runs in a browser and has an expected audience in the single digits.

      The problem at CrowdStrike wasn't some specific, unforeseeable glitch at an otherwise well-functioning organisation. The several levels of incompetence displayed could only be the result of a completely incompetent management hierarchy which allowed it to happen. Not only did they fail at their job - preventing bad things - they made things demonstrably worse. Sack the lot!

  7. CowHorseFrog Silver badge

    You can tell someone is a bad programmer because they use regexps and they of course dont bother to understand what their regexp really does and this time shows yet again ...

    1. Filippo Silver badge

      In this case, though, the error was not in the regexes themselves. It was in the parameter count.

      1. CowHorseFrog Silver badge

        Same shit they didnt even bother to the parameters were compatible with what the regexp was actually supposed to do.

  8. Filippo Silver badge

    So there was no validation of input whatsoever? And a bad address exception is let uncaught all the way to system crash? It seems reckless to the point of being weird. I don't do kernel software, and I still wrap a try/catch around single modules in programs that do multiple things, just so that one oopsie doesn't take down everything.

    My gut feeling is that that bit of code is one of those things that are supposed to run really, really fast, every cycle matters, and it was deemed that runtime validation and exception handling could be skipped, since they have full control of the input anyway.

    This can make sense, in some cases, but there's no excuse for the parameter count mismatch. That should have been caught either by static validation, or by running a debug & testing version that does have runtime validation.

    1. Joe Dietz

      I write this kind of code for security software. I didn't write crowdstrike, but I did write a competitor or two. Crwd seems to have taken an approach of pre-compiling search trees and directly loading those into their kernel filter driver. Every cycle counts since the job is to eliminate system i/o events against a vast corpus of rules as fast as possible. However, I've always taken the approach of building search trees dynamically directly from the source content on the endpoint since you can do much more validation against what the sensor can actually support and avoid this kind of f'up. The tradeoff is that rules are shipped in source form to the endpoint and are readable by anybody that cares to look. I suspect Crwd was all paranoid about people reading them and decided to ship compiled search trees instead which has some risky edge cases between sensor code and compiled content as aptly demonstrated here.

      1. Doctor Syntax Silver badge

        But does shipping pre-compiled trees preclude checking them as they're loaded?

    2. Pascal Monett Silver badge

      Re: since they have full control of the input anyway

      They have full control until some hacker finds a way to make a malformed file input and inject it in the system.

      And since CrowdStrike has demontrated that that is a valid entry point to madness, . . .

      1. Graham Cobb Silver badge

        Re: since they have full control of the input anyway

        I think this is the main point.

        This whole debacle shows that the way to attack important systems has changed: the easiest point for a successful attack is the security code itself! It is the analogy of the movie strategy of infiltrating your attacker into the big guy's trusted security detail.

        Time for everyone serious about security to leave CrowdStrike and at least move to a competitor who's weak point has not (yet) been so dramatically exposed!

    3. Brave Coward

      " [...] but there's no excuse for the parameter count mismatch. That should have been caught either by static validation, or by running a debug & testing version that does have runtime validation."

      Or by basic documentation.

  9. Maximus Decimus Meridius

    Zero Content File

    So what was the channel file with all zeros then?

    1. maffski

      Re: Zero Content File

      I guess it only had 20 zeros. If there had been 21 it would have been fine.

    2. really_adf

      Re: Zero Content File

      Umm, yeah, the result of a file full of zeroes going (unless I am mistaken) entirely unmentioned is puzzling. It seems almost like the sort of thing a journalist ought to dig into...

  10. mark l 2 Silver badge

    If their software is so badly written that a bad channel file caused it to fall over, and they need outside help to fix it. Does anyone have any faith that its actually going to be any good at stopping malware and virus at this point?

    1. Doctor Syntax Silver badge

      It sounds as if they've fixed it themselves so were capable of that. Appointing outside experts is more a case of assurance. But (there's always a but) that's only going to be effective if the review is capable of looking at any decisions which underlay the immediate problem and that they are empowered to change things there. For instance, it has been said by several commentards that development was moved to India. If this the external reviewers determine that this was a problem then they need to be empowered to say it be brought back in-house or that inspection and testing of the code be improved and need the power to sign-off if and when this has been done to their satisfaction. They also need to be able to look at how that decision was made and ensure that action is taken there if necessary.

      I'm sure their customers will be watching and won't be assured if the review doesn't meet their expectations. Ultimately this sort of external assessment isn't made to fix things for the company. It's made to convince customers that hings have been fixed an will stay fixed.

  11. tojb
    Facepalm

    How amazingly commonplace

    Read from uninitialised memory. How very dull, how very heartbleed or a million other bugs based on insecure data structures that don't self-check the size and shape of memory that they are managing.

  12. jcc5169

    Amateurish

    I'm not even a developer, and i know that this would cause a process to blow up .....

  13. Anonymous Coward
    Anonymous Coward

    Novel exploitation of named pipes and IPC

    Windows NamedPipes 101 + Privilege Escalation

    “It is possible for the named pipe server to impersonate the named pipe client's security context by leveraging a ImpersonateNamedPipeClient API call which in turn changes the named pipe server's current thread's token with that of the named pipe client's token.”

  14. Plest Silver badge

    Reap what you sow

    Most companies are getting rid of QA depts as they consider them a waste of time and money, they then get the devs to put in basic unit testing, rig up some simple Jenkins automated build tests and then release code that's not be tested properly by proper QA testers who know how testing should be done. QA depts have been slaughtered in the gaming industry in the drive to save costs and increase profits.

    1. CowHorseFrog Silver badge

      Re: Reap what you sow

      QA department is your last line of defence. They should be writing all sort of tests that would guarantee something as basic as reading a simple file wouldnt blow up.

  15. xyz Silver badge

    So....

    In a nutshell someone counted an array of 20 from 0 and someone else counted to 20 from 1 or vice versa or sideways or whatever.

    How very VB script.

    1. david 12 Silver badge

      Re: So....

      How very VB script.

      Playing to the crowd, are we?

      VB script uses zero-based arrays defined by the maximum index. "5" means "0...5". By this simple means they avoid both the classic beginners 'c' error of expecting an array of [5] to have a '5' element, and the classic 'c' programmers error of expecting a BASIC array to have a '0' element.

  16. JRStern

    Pretty friggin' obvious what it is

    Soon as they said "content file" rather than code file any experienced developer knew what happened.

    Hey, we've all done it, LOL.

    Newbies do it everyday.

    Experienced devs shouldn't do it - unless they're rushed or drunk or something.

    Unit test should catch it - IT SHOULD SPECIFICALLY BE TESTED FOR IN UNIT TEST.

    (but failure testing is boring, devs hate it ... but should do it anyway)

    QA should catch it, but QA doesn't always do these "robustness" failure tests, so don't yell at them too much.

    Coding standards for error handling should catch it and make it harmless.

    Yes, kernel programming can't always do that, but if that's CrowdStrike's business they should be able to handle it.

    It took multiple common, stupid errors at once for this to happen, suggesting that at least four major functions in the company have an opportunity for improvement, let's say.

    1. Richard 12 Silver badge

      Re: Pretty friggin' obvious what it is

      Simply loading it into a single actual, real system would very likely have shown that there was a problem.

      Any software company that routinely blats out something they've never even installed on anything ever needs to be taken out behind the barn.

      1. Duire O'Fender

        Re: Pretty friggin' obvious what it is

        Does anyone else notice a distant echo from NASA's Challenger experience? "It worked ok last time" is not an excuse. Richard Feynman put it better.

  17. TimMaher Silver badge
    Coat

    They didn’t test it.

    Nuff said.

  18. JLV
    Unhappy

    As the wags say: A programmer has a problem. He thinks: "I know, I'll use regexes!". Now he's got two problems.

    Might be unfair to regexes, in this case. They just missed testing a specific edge case.

    Wouldn't a fuzzing-based approach have caught it? Throw all the data permutations allowed by a function signature and observe what happens.

    That, and, repeat after me: staggered releases, staggered releases, staggered releases....

  19. Anonymous Coward
    Anonymous Coward

    Open the floodgates to legal claims

    Define grossly negligent without defining grossly negligent.

    Will be hard to wriggle out of a basic mistake and demonstrate that you were diligent in your work.

  20. boatsman
    Coat

    forgot to check for "do not use a null pointer" in kernel mode. that is extremely stupid.

    nothing to add to that...

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like