It worked on my machine!
A sanity test of actually installing it on a few real machines before deploying it worldwide to 8.5M machines is something that used to be standard QA practice.
Crowdstrike's practices sound like criminal negligence.
CrowdStrike has blamed a bug in its own test software for the mass-crash-event it caused last week. A Wednesday update to its remediation guide added a preliminary post incident review (PIR) that offers the antivirus maker's view of how it brought down 8.5 million Windows boxes. The explanation opens by detailing that …
The difficulty with staggered releases is that there is urgency to push out updates for security software. Who wants to explain to their customers that they had an update that would have prevented the breach they suffered, but had not given it to them beforehand for ... reasons. I surely don't want to be on the list for "Not-so Rapid Response Content" for my security software!
> Maybe Crowdstrike should release their code to their *own* desktops and servers, an hour or so before releasing it to the rest of the the world.
Why people comment without any understanding at all?
CrowdStrike did not release any code. It was configuration, not code.
It really doesn't matter. They should test their releases, no matter what those releases contain, before going public with them. Whether that is code, or configuration, or some other category of data doesn't really matter. If the behavior of the system has changed slightly, testing is required.
Yes, depending on our definitions, we can argue that it's not code. After all, if someone's counting lines of code from me, they usually wouldn't count the lines of the json file I've just written. But when my program will do something different when the json file is different, it can have the same damaging effects as if I changed what we typically call code, and it therefore needs the same kind of testing.
Criminal negligence is a good term for what went down. I genuinely hope the justice system sees it that way as well. Either way, tons of businesses lost a ton of money, and if CrowdStrike doesn't cough it up, then, unfortunately and inevitably, the customers will. That is not acceptable.
This also demonstrates that contracts don't really help that much in terms of damage. Let's imagine you can get money back from CrowdStrike, it is unlikely to cover the actual damage caused and in some cases perhaps your business is already dead anyway.
Let's imagine you're using a cloud service for your core IT, they "do a CrowdStrike" and are down for several days, for many companies that's game over and compensation from a contract which may or may not be honoured isn't going to help.
How would you put that into practice for what happened here?
Use multiple different security vendors staggered across your redundant server infrastructure so that any cluster-type service maintain quorum, in case one of them throws a Crowdstrike?
(That means at least 3 different vendors, licensing, training, managing).
Make sure critical staff has access to at least 2 different workstations not sharing a single unique component of the critical software stack?
Protecting yourself against every imaginable incident means eventually you're just juggling hundreds of tiny little different baskets!
>Use multiple different security vendors
We got hit by a ransomware again.
But I thought we had state-of-the-art security?
We do, 49% of the machines have Crowdstrike, 49% have MS-defender and one machine runs 'Super Number One Pyongyang security - Great Leader Edition"
So which one got ransom-wared ?
Any damage would need to be pursued independent of any contract anyway. As a software vendor, it'll have some comments in there anyway for faulty software. The most you could hope to achieve from the contract would be service credits against the outage.
I can't comment directly on CrowdStrike as haven't seen that contract (or my own companies contract with them).
However, any outage should have been managed as a major incident internally and various DR and business continuity kicked off. This would be covered in any risk register.
> However, any outage should have been managed as a major incident internally and various DR and business continuity kicked off. This would be covered in any risk register.
But how do you stop your DR environment from getting affected? Do you run no security software, or run security software from a different vendor?
The Crowdstrike software license limits their liability to the software fees paid.
Computers are general purpose devices and can be put to many different uses. No software vendor is going to refund consequential losses while charging a standard software fee unless they can control very specifically what you do with their software and thus the risk they're exposed to.
I had a family member that was a lawyer tell me that (in the US at least) "you can't sign away your right to sue for negligence". I was asking about the waivers like what are added to contracts for risky activities. He explained that you can ALWAYS sue for negligence. I would think in this case with Crowd Strike, it could certainly be argued that this was negligence.
Also, remember that a number of the customers that were hit hard will have well funded legal departments. I really don't see this ending well for Crowd Strike.
I would add that deploying a ring 0/kernel level driver that takes input from a regularly updated content file and which does not perform sanity checking on that input file is also criminally negligent.
Even given their dodgy/insufficient testing processes, this whole mess could have been avoided if the driver validated the content file before attempting to execute it...
Yeah, you would expect the Validator to at least verify that the file its validating actually contains data which conforms to the format specific to the design of that template, rejecting anything that isn't correct. That's also ignoring the requirement that the service running on the server performs input validation on any file it's ingesting, although that might not be possible due to the way Windows works. Either way, it's a massive fail.
They may sound like criminal negligence but I think we'll find out they were perfectly legal, covered by the terms of use and Crowdstrike's software will continual to enjoy immunity from liability along with everybody else's.
And, if you are going to blame someone, you must include Microsoft for giving third-party code this kind of privelege.
No, that plays to the Microsoft canard of blaming an open market for security software for the catastrophe. That is in turn a not-so-subtle return the position where MS grant themselves monopoly privileges when writing new software (because only they have complete access to the system APIs). We were arguing that one in the mid-90s.
I would point out that no Crowdstrike customer was deceived into installing the software, and I would expect that all of them fully accepted that they were granting significant system trust to Falcon. What no-one expected was the shocking lack of software quality that allowed a poorly written _data_ update to crash the software and the machines it ran on. Blaming an automatic content validation tool is no excuse; that approach would not prevent an attack by poisoning the data files after validation.
Crowdstrike need to fix that flaw before I would trust Falcon on my kit.
Shhh... Don't mention the fact that people are angry at Apple for protecting people from themselves. No matter how much this is actually needed.
I agree that Apple can be a pain in the but, but people are amazingly good at screwing up their security... or reliability.
I suspect you're referring to the "release early, release often" and "move fast and break things" approaches that have become popular in the last few decades. While often associated with agile development, these have nothing directly to do with it. "Bleeding edge" releases for things like BSD have been there for years with the idea being that they do get tested in the real world, but anyone who puts them in production is ignoring all the warnings and deseves to lose their job. "Move fast and break things" is the mantra that Silicon Valley adopted for building on the MVP (minimal viable prototype) approach which puts gaining market share as quickly as possible above all other priorities, including software quality. "Agile" was supposed to fix problems magically if they occurred to meet the few regulatory requirements and avoid any liability. This worked well enough that the few spectacular failures were deemed acceptable and has gone on to pervert many industries where the maxim "failure is not option" should always be observed.
If this were agile, it would be easy to rollback to the previous version. QA certainly fucked up but this is just as likely to happen with other methodologies as we see regularly with other software providers.
"If this were agile, it would be easy to rollback to the previous version."
And from their side, it was. It didn't take them very long to release the version that patched this. The only trouble was that the buggy version damaged things so badly that the users couldn't revert as quickly as the release process could.
Two things: testing was inadequate; and once again Microsoft's original sin in letting this kind of driver take the system down so effectively. It's not as if the system doesn't have the ability to create save points for rollbacks to preserve integrity.
And it really is Microsoft's failure and the IT monoculture that are at fault here. If Microsoft does not provide better system integrity then repeats and exploits are only a matter of time. Customers should demand better but can only do so if they can make credible threats to move their custom elsewhere.
I recall for sometime ago at my first job (so early '90s) talking to my manager about test rigs. He talked about how much effort, for a safety critical was put into testing the test rig. The test rig itself was relatively simple, but validating it covered all necessary good and bad cases for the test target was substantial.
For something you are pushing updates to at a rapid pace, needs more than just a few tick boxes.
It isn't as if CI/CD pipelines running multiple sets of tests using all sorts of software ( for example on Windows Server 2022) are hard to come about (the VM images for these are freely available.
Underlying organisational culture about "updates faster, respond fast, Fast, FAST" is likely the real problem.
They may have ‘done automated testing’ ….but did they actually test on any real world kit representative of what a customer has - did they put it on some PC’s, servers, registers, signage, kiosks, self-checkouts, embedded etc … virtual or otherwise - and do normal things like reboot them ?
Test kit costs money obvs ….
That seems completely unmentioned….
This revelation just sums up the insanity of where we are.
So much is now reliant on software to test stuff with simulations or whatever shite it does that the fundamental concepts of actually a TESTING something in a live environment has gone.
This is not an oversight or anything like that, it is a disaster that has been waiting to happen. Now it has happened however nothing will change because it is a cultural issue. Too many just will not believe that the old fashioned way of actually installing something to see if it works is ultimately better.
This is because it is considered 'legacy'......
Utter morons.
I'd say you can blame Microsoft twice for this.
First for creating an OS and selling it via questionable means that, despite actual terabytes of updates, is still by default more resembling a colander from a security perspective and thus requires all sorts of shoring up with IT plasters and bandages to keep it together, and next for sacking their testers and making it acceptable to push shocking shoddy, not-even-beta-quality code out without any apology or shame and so make it acceptable for other organisations to do the same.
The problem is that it has never had any real consequences for Microsoft (except with us, but we're such a tiny exception it doesn't even register) - they still get paid. As long as that does not change I do not expect any improvement any time soon.
This WILL happen again.
But it isn't normally difficult to capture and handle unexpected exceptions:
try() {
instantiateTemplate();
}
catch ( ... ) {
handleBadThingsThatShouldNotHappen();
}
Making sure that the exception handler can't throw an exception, of course.
PS - anyone know how to stop the html code blocks on here from adding space around newlines and removing leading spaces?
You're thinking about user code written in a high level langauge, where there is a saftey net in the kernel to catch your screwup, and gracefully return an exception. If you running as a kernel driver you will be wrtting in low level C and possibly assembler, no safety net, no exception handlers, just a hard crash. Made even worse by insisting on running at first boot - total borkage.
That's not really true. Structured Exception Handling works absolutely fine in the Windows kernel. In fact if you write a driver you may well be more exposed to exceptions and end up with a greater appreciation of how important they are for control-flow at the CPU level. They for sure shouldn't be used to catch screw-ups though - that goes for all exceptions doesn't it?
That said, native C++ exceptions don't work. Python exceptions don't work (duh)... etc
The Windows kernel is written in C. C itself doesn't have exceptions at all. Although Windows does have an exceptions mechanism that's kludged in there.
But, that's the wrong solution.
In C, you can write to a wild (invalid) pointer, and that might be caught by the OS or might just write to a random bit of RAM. In the kernel, you can corrupt any RAM that way, causing some other part of the system to go wrong (perhaps much later) in an unpredictable way.
So, you absolutely have to write your code correctly so it doesn't try to write to an invalid pointer. This is not optional. If you're doing that, then you can use the same techniques to make sure you don't read from an invalid pointer.
And once you've done that, you don't need to try to catch exceptions from using invalid pointers. And you shouldn't even try, because there is nothing sensible you can do if you catch one.
If you're a C# or Java programmer, then you might not have come across the concept of invalid pointers. One of the big improvements in those languages, is that they ensure that pointers are valid. They don't have raw pointers, instead they wrap them in object references and arrays. That makes this entire class of bugs impossible.
Rust also makes this class of bugs impossible, which is why the Linux kernel is introducing Rust for some parts. (Both Java and C# use a "garbage collector", which does not fit in an existing kernel easily. Rust doesn't, which makes it a better fit for gradually converting past of an existing kernel to a safer language).
Rust also makes this class of bugs impossible, which is why the Linux kernel is introducing Rust for some parts. (Both Java and C# use a "garbage collector", which does not fit in an existing kernel easily. Rust doesn't, which makes it a better fit for gradually converting past of an existing kernel to a safer language).
Alternatively.. Leave the kernel well alone? This kinda reminds me of a debate from late last century on my degree. In which Z was inflicted on us to learn formal methods and software assurance. But proof in Z meant not much when we then had to hack away in C or assembler and hope the compiler didn't have any bugs on top of the ones we were writing. Which seems to be the problem, ie the kernel is the core of the OS, so if you want all the cruft that's wrapped around it to have a chance of working, it should be left to it's own devices. Which I guess has been the problem, ie the pressure to allow hooks into the kernel.
Windows drivers are able to handle a number of exceptions, and I have used them when writing kernel mode drivers for high speed (at the time) comms interfaces.
Sure, writing through invalid pointers can happen, but page protection should be in place to stop that affecting resources that are not allocated to the current thread of execution. It is, of course, much better (and harder) to ensure that invalid pointers cannot be formed.
This is a driver that will be intercepting calls to the Windows kernel from any thread, so thread level protections don't save you.
The plan for recovery from a bad pointer dereference in the kernel is:
1) Hope it gets detected by accessing an invalid page.
2) Assume that something is already corrupt. Crash the computer (BSOD) to prevent things getting worse and destroying more data.
3) Reboot. Hope that fixes it for long enough for a human (or script or automatic update) to replace the faulty driver.
4. If stuck in a reboot loop, automatically disable all the non essential drivers. Hope that the system boots that way. Then a human (or script or automatic update) can replace the faulty driver.
5. If still stuck in a reboot loop, it's going to need human intervention at the console.
In this case, step 4 would have saved the day, but it failed because CrowdStrike decided their driver was "essential".
I noticed recently (after an update to Fedora 40) that the current C compiler versions are now silently adding a zero byte to the end of string variables - something that Java has done for years, so this makes a nice improvement. A quick scan of the C compiler's man page shows that the Gnu C compiler now accepts options to control this feature (-Wno-stringop-overflow -Wno-stringop-overread and -Wno-stringop-truncation), so this behaviour is now the default.
However, I don't recall seeing any announcements about this, so did I miss the them of was this feature just quietly slipped in?
> I noticed recently (after an update to Fedora 40) that the current C compiler versions are now silently adding a zero byte to the end of string variables
I don't understand what you're asking. C doesn't have a "string variable" type. If you mean string literals then the C compiler has added a '\0' automatically since forever.
Slight nitpick, the Windows kernel is written in a mix of C and C++. However, it's mostly C. Drivers are frequently written in C++, but certain features - including normal exception handling - are unavailable.
As for Java programmers not knowing what pointers are, the dreaded NullPointerException means we know of them. Modern Java encourages the use of Optional for return values that may be "empty", making it clearer and handling of them more explicit.
“Rust makes that class of bug impossible” - no, it makes it impossible to go undetected. So you tried to access memory that you should never, ever access. Rust can’t fix that. It can make sure the error is detected. What then? You don’t have code to handle it, because it’s not supposed to handle.
It’s the kind of thing where a safe language will crash your app to make sure nothing worse happens. Many apps, crashing is quite harmless. In this situation, it’s the worst you could do (but there is nothing else either). So Rust wouldn’t help.
By 'detected' I think you're alluding to bounds checking, which is somewhat different to an unsafe language throwing an Access Violation/segfault only when it hits memory that isn't even there. In the latter case you have no idea if the buggy kernel driver stomped all over important OS data before causing the exception. With Rust bounds checking there may be some logic error further upstream but your memory structures remain uncorrupted. Having a 'wild index' into your container is very much less dangerous than having a 'wild pointer' into raw memory, which can't happen in safe Rust code.
So it is feasible to recover by restarting/reinitialising the driver and let the kernel carry on unharmed, though that depends on expecting the unexpected and coding that behaviour. I think the accepted wisdom is to try to produce driver code which can never panic - that is all potential bounds checks and other panic situations have explicit handling (eg. for a typical container access you simply use .get() and deal with a potential None return if your index is out-of-bounds, rather than [] indexing which panics). That's quite daunting to achieve for anything complex.
"Rust also makes this class of bugs impossible, which is why the Linux kernel is introducing Rust for some parts"
I don't think so. The quote I recall from Linus is a little more sanguine , sometihing along the lines of 'it probably won't hurt to have rust code possible in the kernel'
I do not recall any "roadmap" or project justifications for adding Rust support.
Who tests the tester?
This is a bit silly really. You could have tests for the tester then tests for the tester tester then tests for the tester tester tester. This could go on forever. What happened to having a team to test rather than relying on what we now know to be faulty test automation? How can you even have test automation when like in this instance the fault was unknown to the test automation so never got flagged as a fault? I think it's boils down to age old adage of paying money and ways to avoid it. Why have actual testers when we can do it without them or fewer for less? I can even imagine the meeting they had at some point where they talked about test automation and the money they could save by laying off staff. Pats on the back all round chaps.
Who tests the tester?
Richard Feynman wrote some good stuff on testing. Take 2 teams, one to make it, one to break it. It's one of those things where subconscious biases can affect things. I design something to the best of my ability, work through a bunch of failure scenarios and pass it off to another team, who promptly think of something I didn''t think of and break it.
This is a bit silly really. You could have tests for the tester then tests for the tester tester then tests for the tester tester tester.
Yep. But break it down with your trusty Occam's Razor and you get a faulty o-ring. Or in this case, assuming a simulated test was a real test. It sounds like there wasn't actually any pre-deployment test by letting the update loose on a bunch of test environments and seeing what happened. Around 8,5m systems found that out the hard way.
It's funny but when I was working as a data analyst I used to ask my boss to sanity check my work and he used to say to me all the time why don't you do it. To which I would reply, I've already done that and checked it 3 times with as many tests and reconciliations of the data as I can think of but I wrote it so I think it's right anyway. He got the idea in the end and on the very rare occasion there was a cockup it was caught.
As with this and pre-deployment tests I'm going to assume they run the software on their own systems so you would think they would at least let it loose on their own systems first after all the testing is done before firing it out to the world. Like a final sanity check but I guess not.
v1.0 - a two-position key-operated switch. The positions are "OFF" and "ON". A childhood friend's father had a then-old Chevrolet truck with this type. You stepped down on, and held down, a separate floor-mounted normally-open switch to run the starter motor.
v2.0 - a four-position key-operated switch. The positions are named (left-to-right for dashboard-mounted sub-types, and stern-to-bow for steering-column-mounted sub-types) "OFF", "ACC", "RUN", and "START". "ACC" is for "accessories". Turn the key to START and hold it there for as long as you want the starter motor to crank. Releasing the key spring-returns the key to the RUN position.
v3.0 - a normally-open, dashboard-mounted, or touchscreen-implemented push switch. Press it and hold it down to crank the starter motor. This switch has no effect unless the car's RF key fob (or a reasonable cline thereof) is "in range".
the line "worked in March", is a bit of a bell ringer. Why think it would still work now ?
I also wonder was "break" testing they did. See it so often with testing where things are tested to show it works as expected, that when something happens that should not (such as a button going missing) is missed.
However, reading a config file of any type without internal validation is really rather worrying, thouh I imagine the excuse is "To keep it as small as possible", which would not be the first time to hear that excuse, but looking at the size of numerous files it is complete bollocks
That doesn't necessarily preclude signing. Step 1 makes a file. Somehow, the output was corrupted during or after step 1. The file is then passed to step 2, which signs it. The result is a file that is properly signed, verifies just fine, and once the signed content is extracted, it's still invalid. Steps 1 and 2 don't even have to be separate processes, though they probably are.
There is a very good explanation for this.
You have a data file that crashes crowdstrike while booting. There is a new data file available that doesn’t do that. Every reboot you have a tiny, tiny chance that the new file gets downloaded before the crash. 15 reboots and you apparently have a reasonable chance.
Big customers, who stand to lose much, could send a small sample of their machine park to DownStrike, so that new updates can be tested prior to the full roll-out.
Evil Marketing guy: That's such a nice operation, you have running. It would be such a shame if some bad update would happen to it.
As others have noted, nothing beats real machine testing. I tested some code once (not my code) on about 5 machines and it would work, or it fell over, or it would work etc. I nearly got done for sabotage... Turns out the previous testing had been done on vms and what I'd exposed was a showstopper and my execution was cancelled.
It is interesting to know the reason CrowdStrike missed the bug in its software, but it does not matter that much. Bugs are going to happen. The real problem illustrated by this disaster is the fact that organizations like Delta Airlines allow automatic updates to their systems without testing them first. I understand that inside places like Delta Airlines people are desperately searching for someone to blame. Placing blame will not prevent this sort of mass outage from happening in the future. Switching to another OS or another programming language will not prevent it either. What is needed are operational changes to how software is deployed. Automatic updates to systems that require continuous uptime is the root problem.
Unfortunately, CEOs get rewarded for taking bold risks rather than being prudently cautious. The soon to be ex-CEO of Boing is a good example. Under his direction, Boeing became less of an "airplane manufacturer" and more of a "profitable corporation". This was a risk but in the short term, increased profits were almost guaranteed. Planes fell out of the sky. The CEO was "punished" with a 33 million dollar payout package. As long as CEOs of major corporations continue to be thus incentivized, future computer mass outages are assured.
How can we trust that this software doesn't mess up things in more subtle ways?
It runs in kernel space - what about potential bugs that may make subtle corruptions in memory, that may adversely affect data in other programs but goes unnoticed?.
I'm not comfortable with the thought of third party modules having this sort of access without any of the scrutiny that should be applied.
From reading CrowdStrike's explanation, it seems to me there were four main issues:
1) The content interpreter (running on 8.5 million Windows endpoints) can render the machine unusable when it reads invalid data in an IPC Template Instance - this indicates a QA problem in a critical software component and is quite concerning.
2) CrowdStrike's policy is that IPC Template Instances can be rolled out to 8.5 million endpoints without any testing, as long as they pass the checks in the Content Validator. This appears to reveal a staggering degree of complacency by management.
3) The Content Validator contained a bug which allowed invalid IPC Template Instances to pass the tests - this indicates a QA problem in this software component, which wouldn't normally be considered critical, except for the policy of not requiring any other testing.
4) Someone created an invalid IPC Template Instance and submitted it for checking and release. This is an everyday type of mistake which should have been caught by QA processes and tools and, as a last resort, by validation within the content interpreter.
There were other things which could have been done to reduce the impact, such as allowing customers to know what was in each channel file and decide for themselves how they wanted to deploy them but, given that this wasn't part of the business agreement, I think the four points above are the main failings which led to this catastrophe.
I don't have any connection with CrowdStrike, either as an employee/contractor or as a customer/user/victim. I also wasn't significantly inconvenienced by it, as I wasn't planning to use a plane, train, doctor or any of the other services which were affected.
> I suppose the wrong hundred thousand machines could wreak havoc while ten million personal devices could be an inconvenience.
A key point here is that CrowdStrike is only installed (barring a few home users with more money than sense - or a "borrowed" work key) by companies, and generally larger ones at that.
So personal devices (barring ...) were never at risk from this cockup - instead it was going to be machines that stopped one part of a (big) company doing something, which stopped their colleagues doing something else which...
Crowdstrike has done some absolutely brilliant marketing and sales targeting C Suite plonkers and best of all Cyber Insurance outfits. It is the latter that is the secondary cause of this and is why they are so endemic with their snake oil shite. Carbon Black is a similar product in a similar situation.
We had a huge panic over a malware/encryption/compromised client due to use stupidity. We had to install Carbon Black on all our servers as part of the response from the insurers. This incurred no exceptions anywhere so we were now at huge risk of the backups being duffed up be the very product trying to find malware. Eventually common sense prevailed and a compromise was reached.
Nothing was found after weeks or maybe months.
These teams run by insurers (mostly highly paid but fundamentally non technical consultants) drive a lot of this crap and make things worse. But, and this is the killer, to get the insurance you have to use approved tools, in the event of an incident you have to use approved tools and that means any that is sold as not being "legacy"
And that is the snake oil, anything now that is "cloud native" is seen as infallible and the traditional products as legacy even if the latter works better.
As an aside both of the cloud products I reference could not even detect an eicar file or a test malicious script inseted into a web page.
Both were reliable detected by our existing "legacy " product we were under pressure to replace. How do I know the new product actually works? The answer was that you did not unless it detected something.......
For an European company I work with, only the people who were traveling in Asia at that time were infected.
And of course also the poor IT manager who got a wake-up call at 06:30, started her computer to join the meeting and saw it crash a few minutes later...
People who started their computer after 08:30 didn't get infected.
All servers running 24/7 were down however.
It seems the Crowdstrike programmers haven't heard of this gem: (or the boss didn't allow it).
Parse, don't validate[1]. Summary: Parsing is the act of going from less typed data into more typed data (or an error). After (proper) parsing you know that the data is valid!
It ties in with Langsec[2]. Summary: LangSec states that every input to a program is actually a (little) language. That's best handled with a parser. See point 1 why.
Failing to do proper parsing leads demonstrately to --->
1: https://news.ycombinator.com/item?id=35053118
2: https://langsec.org
> Parse, don't validate
No.
Parse *then* validate.
> After (proper) parsing you know that the data is valid!
Successful parsing says the data is grammatically well-formed, *not* that it is valid.
my_age = 264
may well parse, but it ain't valid. Heck, depending upon the grammar,
my_age = purple
may parse quite happily (I am, of course, a super-intelligent shade of the colour blue, not purple in the least).
You can add semantic checks into the erstwhile "parser" code, but unless you have put those into the grammar (e.g. there is only a fixed set of colours my age could be) then you are just mashing up the terms used to describe what your code is doing.
In particular, validation of data can (often does) involve cross-checking with other data, which need not have gone anywhere near your parser.
Bug in a validator not validating things is one thing.
But surely some developer who created that file must have tried it out and must have checked that it does what it is supposed to do?
I read it was supposed to prevent some use of named pipes by malware. So I would have expected that some developer set up some malware, checked that it successfully used these named pipes, implemented the change, and verified that the malware now failed to use these named pipes named pipe. And while testing this they would have noticed a crash during boot.
So apparently they tried to fix a problem, and the developer couldn’t even be bothered to try out that the problem was fixed?
> So I would have expected that some developer set up some malware, checked that it successfully used these named pipes, implemented the change, and verified that the malware now failed to use these named pipes named pipe. And while testing this they would have noticed a crash during boot.
Go back over your scenario and compare it to Joe Bloggs's PC on Friday morning, as it BSOD.
What is different?
Joe:s machine does *not* have any pipe-using malware.
So, how about the dev "didn't* notice a crash during Boot, because it didn't crash, but spotted the malware and dealt with it. Maybe even called over the PHB to demonstrate the positive case. Job done, sign off, release update.
Whoops, tested the true positive, demonstrated the clever stuff worked. But forgot to test the negative condition. The one that most Users actually have.
If that happened, it would still be a QA failure, of course, but, be honest, who hasn't forgotten to test the negative condition, at least once.
And "once" is the number of times that poor sod of a (hypothetical) Dev would forget.
No, what has been reported was that just _reading_ the new file crashed. It never got far enough to take any action based on the content. Old file: check for condition 1, 2, 3 and 4. New file: check for condition 1, 2, 3, 4 and five. “Five” instead of “5” crashes. Whether any malware tries to do any of these things doesn’t matter.
Well it is possible the developer checked and it worked fine in the development environment while the live environment used different validators and "only" one of the validator on live chocked on the new file. Still a big fail not at least testing it once with the same environment as live (e.g. by rolling out to a single machine) before rolling it out to millions of machines.
@gnasher729 "So apparently they tried to fix a problem, and the developer couldn’t even be bothered to try out that the problem was fixed?"
I not saying you are wrong, but while the release that was deployed to channel was certainly not tested to see if the problem was fixed that does not mean the developer did not test that the problem was fixed.
What the developer tests is not what's deployed for release. The developer tests and then checks their work in. What comes off the pipeline that deploys for release has to be tested because there is no guarantee that something has not gone wrong between developer and what's to be released. We always test what's going to be released don't we?
I'm guessing there was a difference between the version that was tested and the version that got released. That could happen in a lot of ways. Maybe two changes were merged into this file and building them together makes the bad file. Maybe it had to do with some additional content in the production build which isn't present in the debug build. There are plenty more.
I've seen the latter example from time to time. For instance, a task where two people wrote code. First, my coworker wrote one unit, then sent it to me. I wrote the second unit. In my testing, these units worked just fine together. Correct results, no crashes, positive and negative results handled as expected. Build it for production and the automated tests freak out instantly. The reason: my debug build was writing more to the log file in case anything went wrong. That slowed things down slightly, which was enough to prevent the race condition in the two processes from going wrong. Take out the logging and the processes might have a concurrency problem and fail. But it worked just fine on my machine. Probably it would have failed eventually if I ran it with the extra logging enough times, but it didn't in the maybe thirty runs I actually did.
No - it's a great example of new usage of an existing word - how language changes over time ! We all know exactly what the user means ! Language purists screw themselves up on things like this all the time. Language is an ever-evolving thing, subject to whatever new variations of it get introduced as seems fit. We now barely understand what English speakers of ancient times meant by what they wrote !
Not quite a graybeard, but I am getting there...
This whole debacle is history repeating itself. Seasoned endpoint security (Trend, McAfee, Symantec, etc) all went through the same sorts of growing pains back in the late 1990's. The lessons CrowdStrike is now learning (like staggered updates) were also learned the hard way by those vendors decades ago.
CrowdStrike simply doesn't have that history in its DNA to know where all the pitfalls lie.
QA , if anyone is employed to do it anymore.
We have mission critical stuff deployed all the time , ok its not to 8.5 million PCs , but the results for us could be devastating, or even lethal.
Thats why we test our programs on the CAD/CAM while programming, why we test the output of that on simulation software, its why we take great care in making sure the highly expensive machinery does what we tell it to, only then do we unleash the machinery to produce 500 widgets an hour(or whatever we're making), and why anyone making changes has to have a rollback position and the knowledge to ensure that he/she has not made a booboo such as releasing an untested config file.(or ramming a 2" diameter drill into a chuck running at 3000 RPM.... makes a diaper changing day that does... for all the staff).
And the aerospace stuff........ you dont even want to see the mountain of paperwork we have to fill out to do that...test test and test again to prove we've done our job right....
If only the anti-virus(and indeed most of the software creation companies) were held to those sort of standards....
https://techcrunch.com/2024/07/24/crowdstrike-offers-a-10-apology-gift-card-to-say-sorry-for-outage/
On Wednesday, some of the people who posted about the gift card said that when they went to redeem the offer, they got an error message saying the voucher had been canceled. When TechCrunch checked the voucher, the Uber Eats page provided an error message that said the gift card “has been canceled by the issuing party and is no longer valid.”
nice....
Stop feeding kernel mode code with data downloaded from the Internet, you idiots.
We can't rely on a vendor not screwing something up (as evidenced by this debacle), but all validation and certification from third parties is useless if the validated code can be reconfigured dynamically by crap downloaded from somewhere. This needs to stop. Now.
This kernel module is loading something from the filesystem. So we know it can, and that presumably the directory is trusted.
Why not:
1. Write out canary.txt - “I’m loading channel file X”
2. Load channel file
3. Delete canary.txt
If the kmod loads and finds a canary.txt, don’t load the channel file it lists!
There, global catastrophe prevented.
The canary.txt: iOS has a feature where you can write some description of your program state before an app exits, and that state is restored when the user launches the app again. If there is a crash while restoring the app state, that program state is automatically deleted so the use doesn’t end in a loop where the sop instantly crashes on every launch attempt. Using this method.
Cloudstrike should host servers from Microsoft, AWS and othet cloud providers in their internal testing. E.g. disk drive vendors have array controllers from Netapp, Dell, HP, .. to test drive updates before they are unleashed upon those costomer's costomer's systems.
Also, auto- update is antithetical to reliability and security. Hanging back a day and listening for screams from early adopters is a good idea.
Staging all software updates to pass through a “Test” environment is basic Quality Assurance procedure. Especially for a kernel driver! For anyone not familiar with this, it would simply be a group of PC’s in a lab that receive the updates before a release to customers. If they all go offline, the update is (obviously) a dud.
This is the software equivalent of the Titan Submersible.
...for aptly demonstrating how dependent our IT infrastructures are on trusted vendors, and how vulnerable they are to wild defects. When that trust is misplaced, as was the case here, really bad results can occur. It's something like "who watches the watchmen". The QA processes for an entrusted security vendor need to be far more robust than this episode suggests. I suppose it could have been a lot worse.