back to article CrowdStrike's Blue Screen blunder: Could eBPF have saved the day?

The CrowdStrike chaos was caused by software running riot in the Windows kernel after an update tripped up the code. eBPF is a useful tool for kernel tracing and observability, but could it have mitigated the CrowdStrike incident? "It's interesting," Tom Wilkie, CTO of observability specialist Grafana Labs tells The Register …

  1. Doctor Syntax Silver badge

    "So, in this particular case, we had a configuration change, which is like there's no code, its just a config that the sensor consumes. And we went through a validation process and we validated all those. They actually worked. The problem is we had 21 of them and the sensor understood 20. And that's the simple explanation of what happened."

    The problem wasn't that they had 21 and the sensor only understood 20.

    The problem is that the sensor didn't handle being presented with 21 safely. And the more generic version of that is that anything that goes into a kernel has to be very, very good at dealing with the unexpected because there's no safety net. That applies whether it's a malware sensor, a driver, a binary blob, eBPF or anything else.

    1. Will Godfrey Silver badge
      Unhappy

      Exactly

      Just how crap does the parser in 'security' code have to be to not check the number of parameters - preferably before even attempting to use any of the contents?

  2. BJC

    Would eBPF really have helped here?

    As I understand the issue (mainly from the Dave Plummer YT video), the AV vendor used a kernel driver. The AV vendor marked the kernel driver as critical (i.e. stop OS if the driver fails). Presumably this option was used as the AV protection being deemed critical, by the AV vendor, to avoid potential infection when the AV protection isn't available. The AV supplied driver failed, apparently through poor coding for input validation. The OS detected the driver failure, identified that the driver was marked critical, and stopped.

    If that is correct, the kernel detected the failure but the configuration of the driver (by the AV vendor) caused it to stop. How would eBPF help with that?

    1. Anonymous Coward
      Anonymous Coward

      Re: Would eBPF really have helped here?

      Why would it? This was an ad for Grafana...

      1. BJC
        Happy

        Re: Would eBPF really have helped here?

        Betteridge's law of headlines

      2. Gene Cash Silver badge

        Re: Would eBPF really have helped here?

        Not really. If you actually read the article, the CTO of Grafana himself says:

        "Now, could you catch something like the CrowdStrike incident with eBPF? Yes. Probably. But honestly, you could also catch it just by doing better testing, and that would be my advice."

      3. diodesign (Written by Reg staff) Silver badge

        Just no.

        If we interview someone, it's not an ad. It's an interview. Ads are labeled as such, they're handled by our publisher's non-journos, and we charge a healthy amount of money for them, thank you very much.

        We don't give ads away for free.

        C.

        1. sedregj Bronze badge
          Windows

          Re: Just no.

          "We don't give ads away for free."

          Quite right, that's a form of corporate suicide for any dog eared rag 8)

    2. Joe Dietz

      Re: Would eBPF really have helped here?

      Security software is just like this. It doesn't matter if it's in the kernel or runs in userspace etc. (well actually it matters quite a lot - userspace is NOT effective and has terrible performance and compatibility problems).. But by its very nature security software MUST be able to modify the running system... which means if f's up... your system may not run anymore. This time it was a BSOD in there driver due to some quality issues, but simply blocking particular processes due to a detection false would have the exact same result in windows (BSOD and a looping one at that).

      The actual _issue_ here is that McAfee did all this in 2007, 2009 etc. and Kurtz was CTO at McAfee... and apparently did not take those lessons to his new organization. Security software must be built and supported with a culture of obsessive safety and not just from 'attackers'. Which means testing... but also looking at designs critically and avoiding ones that are going to fail in interesting ways. Culture comes from the top.

      1. Anonymous Coward
        Anonymous Coward

        Re: Would eBPF really have helped here?

        While your valid but very generalized point covers the real issue that things like Crowdstrike jam their fingers deep into the guts of the host operating system, it leaves much out in the eBPF part of the article.

        eBPF also provides a deep lens into the internal state of kernels that support it. It also has had a long legacy of problems, as it was yet another example of Lennart Poettering once again cowboy coding an under-planned and potentially dangerous new feature. It like much of his work, was slapped together, poorly architected, and promptly produced the security and stability problems people warned him about from when he publicly announced what he was working on. Setting aside that the actual eBPF implementation is incompatible with the BPF structure and syntax, and also basically is just trading off the familiar name and co-opting it, an tool to investigate more systems in the kernel than usb and network traffic fills a need.

        But as the article indicates, the security guarantees are inadequate, and Poettering tends to react to all feedback that could be construed as criticism, even attempts at constructive criticism as if he is being trolled. So instead of working with the community to architect a safer, stable tool that provided at least a compatibility layer with the actual BPF, he applied his own band-aid to the most glaring problems, them moved on to breaking something else. No change there.

        So I'd say the interviewee missed that opportunity. The author could have prompted them in a question, but it wasn't required too, and justly might have wanted to leave Lennart's name out of it to cut down on moderation headaches later. As one of the more often to respond Vulture's, we know Dio reads the posts.

        So to close I would say that security software is often the cause for problems in it's own right, but the answer isn't for us to whine about it, it's to pressure the OS makers, security companies, and open source community to start tackling the root issues with the tools head on. An improved eBPF or a replacement to it would be a handy tool. Too often an obviously bad but popular tool breaks things at scale these days, and security companies are some of the worst offenders when it comes to QA and code quality.

        That is not an inevitable outcome, the kernels tackle thornier problems fairly gracefully, but it will take making it more of a priority and allowing changes to be made.

  3. Nate Amsden

    200+ data sources

    for Grafana, doesn't sound like much...

    I've been using LogicMonitor since 2014 due to it's large data source support, and they claim 3000+ integrations.

    I remember having talks with Data dog a few years ago they were convinced they could replace LM for me, but after some talks it was quite a joke what they were offering (they were suggesting use their generic SNMP templates to make everything manually). My previous org used the open source grafana for a while(for some things, while most core infrastructure was monitored by LM), though I never touched it, was too complex for me to get interested. LM is super simple and powerful(and I've added tons of custom stuff that they don't support). Though they do sometimes change their UI around(they are in the midst of that now) which is SUPER ANNOYING.

    LM isn't cheap though, I'm sure it's more expensive than grafana cloud stuff.. LM itself is hosted by them (I think partly in public cloud partly in their own colo(s).), can't self host. Though I suspect even if they did allow self hosting I may not want it I suspect it's too complex and buggy to be able to run myself(have said this before, I suspect many SaaS apps are just buggy messes that they host themselves because it's easier to have their staff manage than try to make the software super stable enough for customers to run, also the services revenue doesn't hurt either).

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like