How did this happen? It seems to me that it happened because CrowdStrike’s quality engineering and release procedures are nowhere near fit for purpose. But it also happened because Microsoft still only pays lip service to security on Windows. CrowdStrike shouldn’t have released what they did - but equally it shouldn’t have been able to take down Windows.
How did a CrowdStrike file crash millions of Windows computers? We take a closer look at the code
Last week, at 0409 UTC on July 19, 2024, antivirus maker CrowdStrike released an update to its widely used Falcon platform that caused Microsoft Windows machines around the world to crash. The impact was extensive. Supply chain firm Interos estimates 674,620 direct enterprise customer relationships of CrowdStrike and Microsoft …
COMMENTS
-
-
Tuesday 23rd July 2024 21:51 GMT doublelayer
This article explains, if you didn't already know, why Windows has to go down when code which is running as part of the kernel breaks this badly. Guess what would happen if a kernel module I loaded into Linux, Mac OS, or any other operating system had a memory violation. That's right, it would panic. It is required to panic. If it did not panic, that kernel has a serious reliability problem.
Until people understand that, the attempts to find a reason why Microsoft is to blame here will not work. Maybe you or someone else can actually find a thing that Microsoft should be doing differently related to this, but while people continue to post comments trying to blame it for doing something both standard and necessary, you will fail to make any case because it appears that you have a gap in important systems knowledge.
-
Tuesday 23rd July 2024 23:07 GMT Anonymous Coward
You miss the point. Yes, buggy code that ends up in the kernel will cause crashes.
The issue here is why a vulnerability tool has to go in the kernel. Something like that should only be running in user space: no ifs or buts.
Microsoft is to blame for allowing any old shit to go into the kernel. Just as other OS implementers would be to blame if they did that and exposed their customers to Crowdstrike's fuckup. But they didn't. Microsoft did.
-
Tuesday 23rd July 2024 23:17 GMT doublelayer
Several things in your comment are wrong or misleading:
"The issue here is why a vulnerability tool has to go in the kernel. Something like that should only be running in user space: no ifs or buts."
It goes in the kernel so that it has more visibility and control over what happens. There are some things that can't be done from user space at all, for perfectly good security reasons, and others which can't be done efficiently from there.
Next, the Microsoft is to blame for putting it in. They didn't. CroudStrike is not a Microsoft product or dependency. People install it. Just as if I write a kernel module, I didn't ask for or get Linus's sign-off before running it. People are able to install things at kernel level, and they make the choice whether to do so or not. It is not Microsoft's decision to permit it, and if it was, we would be rightly complaining about the level of authority they claim to have to make that choice for us. They should not and do not deny people the right to do something potentially damaging with their own computers.
-
Wednesday 24th July 2024 01:09 GMT Anonymous Coward
"It goes in the kernel so that it has more visibility and control over what happens."
Bollocks! One of the principles of security is least privilege. Which IMO means nothing goes in the kernel unless it absolutely must go there. The kernel is no place for a fancy anti-virus tool.
"There are some things that can't be done from user space at all, for perfectly good security reasons, and others which can't be done efficiently from there."
Agreed, Crowdstrike's crap has no good reason to be anywhere near the kernel.
"Next, the Microsoft is to blame for putting it in. They didn't."
I never said they did that. They provided the means to do let hazardous code get into the kernel. In the same sense as cigarette manufacturers provide the means for people to inhale hazardous chemicals and get lung cancer.
If Microsoft cared about their reputation - no sniggering at the back! - they would have some way of validating, auditing and testing third-party code that wanted to go in their kernels. Either they didn't have that or their checks went wrong by giving a free pass to Crowdstrike's shit. Which means Microsoft are partially to blame for the recent global meltdown. They're not the only ones that have to take the responsibility for that of course. But they have some culpability.
-
Wednesday 24th July 2024 08:51 GMT joeldillon
' IMO means nothing goes in the kernel unless it absolutely must go there. The kernel is no place for a fancy anti-virus tool.'
Andrew Tanenbaum and the other microkernel advocates lost that argument in the 1990s (to Linus Torvalds among others). Some (many) things do go in the kernel which in theory need not do but for efficiency reasons do.
-
-
-
-
Wednesday 24th July 2024 20:52 GMT BossHobo
This comment had me engaging with chatgpt to learn a little something. Shouldn't today's multiple cores and threaded environments lessen the impact of such context switching? Could we (rather, OS developers) plausibly move drivers back to user space and would it be worth the added stability?
-
Thursday 25th July 2024 10:53 GMT Julz
Spot On
Preemptive execution and huge caches strikes again. Given some situations, you can have a lot of rolling back and invalidating of cache lines to do on a context switch which has caused all sorts of design decisions to be made including the one to move away from micro kernels which, I feel, is a poor one. When I was doing such things, I measured Ultra SPARC CPUs using an average of 4 clock cycles to perform a context switch in and out of kernel space. That figure on modern CPUs and kernels is in the order of hundreds and even thousands of clock cycles, and no, they are not clocked that much faster. The quest for straight line CPU speed and marketing bragging rights amongst CPU manufactures has had many consequences in both the security of and the real world speed of systems.
-
-
-
-
-
This post has been deleted by its author
-
Wednesday 24th July 2024 19:37 GMT Michael Wojcik
Still wrong.
Anti-malware software has to run with kernel or kernel-equivalent privilege, because it has to hook scores of APIs. Do a little research before you post and you won't sound so ignorant. Read things like Marcus Hutchins' blog, for example, and learn how things actually work.
If anti-malware software ran merely as a privileged user-space process, then malware running as a privileged user-space process (which happens all the time on all the major OSes,1 because of the same architectural issues) could bypass it.
Now, it's conceivable that a microkernel OS designed for security with good privilege separation, like CheriOS (which is built on top of CHERI), might provide a better, less dangerous position for anti-malware software to run in. It's not trivial to figure out exactly where to place it in that architecture, but at least there are more choices. A virtualizing OS like Qubes OS (or even IBM VM and its descendants) might also provide some better options. And, sure, some people will argue that the original NT driver model with the HAL would make this sort of failure less likely. But a complete rearchitecting of Windows at this point would be an infeasibly expensive and risky undertaking at this point.
And Linux and UNIX are no better; they also have drivers in the kernel, and anti-malware there faces the same problem: run in the kernel and risk panics, or run in user-space and risk being bypassed by other privileged user-space processes. (I understand Crowdstrike on Linux, when run in user-space, just compiles a bunch of code using eBPF, which means it still has code running in kernel mode. And if you think eBPF has never been a problem, you aren't paying attention.)
The real fix is to move to an actually modern OS like CheriOS. Good luck with that; some people need to run software they don't control or don't have the luxury of rewriting. In the meantime, rather than complaining, consider running Windows in VMs rather than on the bare metal. There are other host OSes, like, oh, Qubes.
1Yes, including e.g. zOS. Hello, APF-privileged loadlibs!
-
-
Wednesday 24th July 2024 01:29 GMT DS999
Sorry Microsoft shares a lot of the blame here
It goes in the kernel so that it has more visibility and control over what happens. There are some things that can't be done from user space at all, for perfectly good security reasons, and others which can't be done efficiently from there
Crowdstrike's Linux software comes in two versions, one that has a kernel module like Windows but also one that runs in user space and uses the eBPF (packet filter) facility, which is the one they encourage people to use. If Windows had a similar user space API, the error would have only brought down the Crowdstrike software, not the entire Windows kernel.
macOS goes one better - they have managed to move networking out of the kernel entirely and it runs in user space, so similar software (I'm not sure if Crowdstrike is even available for the Mac) would never have needed to run within a kernel context.
So no, Microsoft did have a choice here. They've supported antivirus applications since the early days of Windows and have had plenty of time to build better ways to interface them with the OS. Instead they've chosen to invest developer time regularly changing the GUI and recently on giving Clippy "AI".
-
Wednesday 24th July 2024 05:24 GMT Dan 55
It goes in the kernel so that it has more visibility and control over what happens. There are some things that can't be done from user space at all, for perfectly good security reasons, and others which can't be done efficiently from there.
Only because MS never made an endpoint security API to make this stuff accessible from userland. Something like this, which Crowdstrike uses on the Mac version.
Please let's not defend Window's architecture as the way things must be when we have two examples in alternative OSes of how it can be done better.
-
Wednesday 24th July 2024 12:53 GMT Snake
RE: examples that have done better
That's a tall claim! The two OS's that are mentioned to have alternate methods of hooking an AV system into, can you confirm that Microsoft's OS structure actually has that ability without breaking decades of compatibility?
You're talking out of your league and out of your butt. The kernel method of AV integration is how MS's OS's have always worked and we have not been shown that doing otherwise is even an option in their kernel architecture. The OS's mentioned even have user isolation and security that MS doesn't even have, and may not ever be able to have without breaking decades-old software.
-
Wednesday 24th July 2024 13:03 GMT Dan 55
Re: RE: examples that have done better
AVs and endpoint security software by definition has to be up to date. MS could develop an API and make developers transition over simply by not signing off any more WHQL drivers for this kind of software after a certain date. It wouldn't break any other decades-old software, why would it?
So the remaining question is it even an option in their kernel architecture. It should be. Can they make the necessary changes to accomplish this? That's up to MS. If they can't it doesn't bode well for future improvements to Windows that aren't just rejigging the UI.
If it turns out they can't then they should at least make recovery easier. Return F8 and "last known good configuration" to the boot sequence and make recovery from failed driver loading easier and more automatic.
-
Wednesday 24th July 2024 15:25 GMT Tron
Re: RE: examples that have done better
MS should still have had a recovery position, perhaps one that distinguished between stuff in the kernel that was their own/essential, and 3rd party stuff. If you like, a 'safe mode' for kernel crashes. As a previous comment said, they prioritise degrading their OS with worse versions, gimmicks, bloat and restrictions, when we would be happier to have stuck with the same OS for years longer, with better resilience being added.
So some fault to MS, but most to CrowdStrike who could not have done anything like enough testing on their code, or they would have known it was cack.
-
Wednesday 24th July 2024 16:08 GMT doublelayer
Re: RE: examples that have done better
You mean that one mentioned in the article. It might have read something like this:
"The way that it works is that drivers can set a flag called boot-start," he said.
"So normally, if you've got a driver that's acting kind of buggy and causes a failure like this, Windows can auto resume by simply not loading the driver the next time. But if it is set as boot-start, which is supposed to be reserved for critical drivers, like one for your hard drive, Windows will not eliminate that from the startup sequence and will continue to fail over and over and over and over again, which is what we saw with the CrowdStrike failure."
So they have that by default, and it would have done exactly what you describe except that a flag was set specifically to bypass that safety feature. As it says, there's a good reason to allow something to set itself that way, in case this is required for the system to boot correctly anyway.
-
-
Wednesday 24th July 2024 19:43 GMT Michael Wojcik
Re: RE: examples that have done better
Kernel-mode anti-malware on Windows typically uses boot-start because otherwise there's a race at startup which malware can exploit. That should be obvious.
So it's a tradeoff between risks, as happens all the time in security. You can argue Crowdstrike made the wrong choice — and I'm not interested in defending them; the Friday incident was a monumental screw-up and they don't deserve any defense — but it's neither arbitrary nor patently incorrect. We've seen plenty of Windows malware delivered as signed drivers (because many OEMs and ISVs are not good at key hygiene), so early loading has justification.
-
-
-
Wednesday 24th July 2024 18:09 GMT Anonymous Coward
Re: RE: examples that have done better
Last known might not work here. windows wouldn't have known about the update as it didn't go through the windows install process. this update was done without windows knowing about it. You would probably have had to go back to the original crowdstrike app/driver install.
-
-
Wednesday 24th July 2024 18:18 GMT Someone Else
Re: RE: examples that have done better
The kernel method of AV integration is how MS's OS's have always worked and we have not been shown that doing otherwise is even an option in their kernel architecture.
And, of course, that makes it right and forever immutable.
Bollocks!
What it makes it is hidebound. If Micros~1 could ever be made to get off their thumbs and behind the concept of refactoring (and innovation, but I digress...), perhaps they would figure out a way to eliminate this problem.
"Perhaps", because given their recent track record, it's not cleat they could write a "Hello, World" program without fucking it (and/or something else) up.
-
-
Wednesday 24th July 2024 13:06 GMT Peter Gathercole
Endpoint Security API
This MacOS facility looks a lot like the Auditing API on AIX which has been in use for 30+ years on POWER/PowerPC/Power systems.
The protection it provides depends on the event code that contains instrumentation. I don't know MacOS, but I am somewhat familiar with AIX.
In that OS, pretty much every system call has a block of code that can be configured to drop useful information into a ring buffer, which can be read using a specific system call to allow out-of-kernel code to process the events for auditing purposes. There are selectors that allow events to be not logged, or acted on or logged buy ignored depending on the configuration of the subsystem.
But if MacOS works like AIX, this is not actually enough to act as an AV filter. It does allow you to notice when files are being opened or accessed. It does allow for all sorts of other system events to be recorded (originally, almost all kernel routines were instrumented, but I cannot say that is true for AIX any more). But importantly, for things like file or network read and writes, it does not give you the ability to look inside the data being passed around. It's like it keeps the metadata and a subset of the data itself, but not the whole data, a bit like the phone system that may keep information about who you called and when, but not the actual conversation.
This makes it useful to record (and in some cases act on) certain events, but does not give it the complete access that an in-kernel driver, which has access to all of the internal buffers and data control structures in the kernel in a traditional monolithic kernel.
Someone mentioned eBPF for Linux and Linux-like OS's. This allows you to hang external code written in a p-code on hooks provided in some kernel routines. This can be done dynamically after the OS has been started, and has been criticised for being able to intercept and in some cases modify in-kernel data using code that is not in the kernel itself. I have always worried about this as a feature, because it sounds to me like a way of altering kernel behaviour with little or no oversight. The original BPF code allowed you to put code into the pseudo machine implementing the filters to trigger certain actions like dropping or logging certain specific (originally network) events, but eBPF allows much more access to the data, and works on more than just network events. It probably does allow the type of actions that CrowdStrike want to do, but the code runs in a kernel pseudo machine that can be allowed to crash without affecting the OS.
-
-
Wednesday 24th July 2024 10:52 GMT andy 103
CroudStrike is not a Microsoft product or dependency. People install it.
The problem being that people install it at airports, banks, and on other critical infrastructure. Then can't cope when it all falls down.
All from organisations with reams of policies about that not happening, yet allowing it to happen so easily.
-
-
Wednesday 24th July 2024 16:12 GMT doublelayer
Yes, they could have implemented a two-stage process where they still have a kernel-level program and it provides data out to something else. There might have been an efficiency drop by doing that, but it would probably be fine enough. The critical point, however, is that this change, while it might have prevented this problem, still involves their being code running at kernel level which, if it broke, would break the kernel. The attempts to blame Microsoft often take the form of explaining that CrowdStrike shouldn't have run anything at kernel level at all, which would not work, and then finding a reason why it's Microsoft's fault that they could, which it isn't.
-
-
-
-
Wednesday 24th July 2024 06:20 GMT Adair
Which even if true at an absolute level (moot), doesn't absolve MS from it's responsibility to provide an OS that is capable of effectively mitigating such disasters, e.g. rollbacks, immutability, etc.
We're still expected to put up with what amounts to a 'toy OS' for use in frontline services as though that is acceptable and normal.
Smells more like 'profit before people/security'.
-
-
-
Wednesday 24th July 2024 15:02 GMT Steve Channell
This article says they couldn't ("https://www.theregister.com/2024/07/22/windows_crowdstrike_kernel_eu/")
There is simply NO EXCUSE for what CrownStrike has done: they've taken advantage of a signed kernel driver to side-load code into the kernel in contravention of the license agreement. We can expect a future Patch Tuesday to black-list the csagent as malware.
To describe these sys files as "broken configuration file" insults the intelligence of readers : either they include binary executable code, or are virtual environment like https://en.wikipedia.org/wiki/EBPFbut inferior.
-
Wednesday 24th July 2024 18:20 GMT Dan 55
The EU only said that whatever MS did for their own endpoint software, they should offer third parties the same access that they have themselves.
MS could have developed a security endpoint API for themselves and third parties but instead they decided to allow kernel access for themselves and 3rd parties via kernel drivers.
That doesn't excuse Crowdstrike's shonky code of course.
-
Wednesday 24th July 2024 19:43 GMT Michael Wojcik
To describe these sys files as "broken configuration file" insults the intelligence of readers : either they include binary executable code, or are virtual environment like https://en.wikipedia.org/wiki/EBPFbut inferior.
Eh? That's prima facie incorrect. It is certainly possible for invalid data to trigger a logic bug in the code that interprets it. Indeed, that happens all the time. Have you never heard of fuzz testing?
A trivial example: Read an integer type field from a record, use it (without validation) as an index into a table of addresses, attempt to dereference the retrieved address (perhaps plus some offset). If the record has an invalid type value, you can get precisely the symptoms seen in this case.
-
Tuesday 30th July 2024 16:50 GMT Steve Channell
Nope, did you not see the reference to eBPF?
eBPF includes a verifier where shoddy software does not.
As someone who has written kernel-level code, I can assure that it is possible to write formally provable code using Formal Methods - I don't anymore because the levels of code review, profiling, verification, {unit, system, integration, regression, performance} testing are prohibitive. duff data is only an issue if you don't verify it.
I know my surveillance code as 100% reliable, not because I'm some kind of genius, or used VDM mathematical proof; it was 100% reliable because it checked every pointer and fell back to a read-discard loop (after WTO instruction) that ensured it did no harm. CloudStrike's code is surveillance code - their first work-around "fix" was to delete csagent.sys
to translate your comment " It is certainly possible for shoddy invalid data to trigger a shoddy logic bug in the shoddy code that interprets it. Indeed, that happens all the time in application code"
-
-
Wednesday 24th July 2024 19:48 GMT diodesign
'broken configuration file'
At the time config file was the best description we had. This is an evolving saga. Our latest article (linked) gets closer to the specifics, that the channel files customize how templates of code run to detect particular malicious activity.
The file in this case was poorly formed, caused its interpreter within Falcon to crash. This was missed in the automated testing.
C.
-
-
-
-
Wednesday 24th July 2024 12:54 GMT fg_swe
LSM Linux Security Modules
LSM also uses kernel level code to control and intercept potentially ALL userspace-to-kernel calls.
Yes, some security things must be done in kernel mode. But with that comes extreme duty of diligence by the "plugin" author. A config file error must never generate a bad pointer, but simply ignoring said config file.
CrowdStrike has a very bad quality assurance in place. The government should fine them for neglicence.
-
-
This post has been deleted by its author
-
Wednesday 24th July 2024 11:50 GMT gnasher729
If this happened to Apple: 1. Apple would tell crowdstrike “the way you f***** up, we won’t let any crowdstrike code run in the kernel anymore. If that’s a problem, figure it out”. 2. Crowdstrike sells itself to tencent which runs to the EU which fines Apple for being “anticompetitive” and hurting consumers.
-
-
Wednesday 24th July 2024 05:11 GMT Yorick Hunt
Up to and including XP, you'd be offered "Start Windows with last known good configuration" in the event of a boot sequence failure. In *nix land, Grub keeps copies of old kernels and boot scripts and allows you to similarly revert to something which actually works.
Newer versions of Windows though, following Microsoft's decree of "users are too stupid to manage things," simply steamroll through, no matter what, at full speed towards the introduced brick wall.
Given the number of snapshots being taken left, right and centre, surely it shouldn't be that difficult to implement a failsafe boot process? Or is this going to be their major selling point for "Recall?"
-
Wednesday 24th July 2024 06:06 GMT Anonymous Coward
As far as I recall, all last known good configuration does is use the alternate HKLM control set (there are two that are switched at each boot) so it is unlikely to have helped in this case.
In fact, I cannot recall a single situation where using last known good helped solve issues I was facing.
-
-
-
Tuesday 23rd July 2024 21:56 GMT david 12
RedHat had exactly the same problem with a CrowdStrike causing a kernel panic after loading a bad channel file.
If you want a resilient system that can't be taken down by drivers that are marked as required for boot, then you want a different kind of machine architecture, not a different OS.
-
Wednesday 24th July 2024 01:35 GMT Andrew Hodgkinson
RedHat is buggy; that's not a "new architecture" issue
Allowing kernel drivers to fail gracefully is a long-solved problem, but quality engineering is expensive and mainstream vendors are cheap a**holes only interested in shareholder gains. As for this specific RedHat crash - please read:
https://news.ycombinator.com/item?id=41030352
-
-
-
-
-
Tuesday 23rd July 2024 21:41 GMT Doctor Syntax
Re: Canary releases?
The term's been around for a while, if not in relation to releases, in other contexts. About 10 years ago it became a practice fo post assurances that a business had not been serves a subpoena by a given date. Failure to update it was an indication that it had received one without breaching any terms the subpoena may have contained forbidding an announcement that it had. The origin, of course, is a comparison with the coal-miner's canary which would be more susceptible to carbon monoxide poisoning than the miner - not a close analogy with the warrant canary but it fits well with a sacrificial S/W instance which can be exposed to a pending update.
-
Tuesday 23rd July 2024 23:02 GMT Malcolm Weir
Re: Canary releases?
Where this works (and it doesn't always) is in legal jurisdictions where compelled speech is forbidden or extremely disfavored (e.g. the USA). A court can order someone to NOT say something until the matter has been fully adjudicated, but requiring a statement usually is impossible until after the court has heard, and judged, both sides.
Of course, a government can always request speech ("I counted them all out..."), but the decision to comply has to be voluntary otherwise the speech is compelled.
-
-
Wednesday 24th July 2024 01:15 GMT that one in the corner
Re: Canary releases?
> I just used to call them phased releases, back in the day.
But were you dealing with code that could totally knacker the machine?
The canary falling of its perch is a good analogy for a BSOD (shortly followed by an explosion - of expletives heard all around the open plan office).
But if your app failing just meant it had to be restarted whilst the rest of the User's tasks progressed as normal - well, "signal the alarm, the canary has a bit of an itchy wing" doesn't have quite the same ring to it.
-
-
Wednesday 24th July 2024 13:15 GMT fg_swe
Bingo
Cybernetic attackers will analyze patches in order to attack not yet updated systems.
Patches should be thoroughly tested by the authoring company.
Also, they should be ditributed/staged encrypted on all affected computers and only after that the key should be broadcast and patch actually applied.
-
-
Tuesday 23rd July 2024 22:10 GMT Anonymous Coward
So why was table lookup done in pspSystemThread?
As someone who has looked at far too many driver crash dump screens since the mid 1980's (usually my own code) the first thing that jumped out was why the hell was this being done in the main thread in a direct OS call Kernel Space thread and not a worker / auxiliary thread.
Any code called from a OS kernel dispatch / callback should be just doing very light-weight stuff. Unless it is an actual hardware device driver. And even then you try to keep those main call threads call trees as lightweight as possible. All heavy lift code should be on separate threads. Even better - processes. Either in Kernel Space, or much better, User Space. User Space for agents. Always. So that when stuff does wrong (which it always will) you have at least some chance of handling it gracefully. Without bringing down the whole damn OS. This is the NT kennel after all. Where since 4.0 drivers are now in Ring 0. So no safety net.
At least running agent code in User Space will give you some form of Structured Exception Handling support. Which stops BSODS from happening. Mostly. In Kernel Space there are fewer options but exceptions like page faults can be handled. And have been for many decades. If you know what you are doing.
Then there is the fact that all Kernel Space user application code should be written with asserts everywhere. And I mean everywhere. Every second or third line should be an assert. With the assert code supporting graceful error / failure handling and recovery. And not only checking legal range of everything going into calls but legal range everything coming out. Just like properly engineered code. In embedded and RTOS software.
We can blame MS for putting drivers and related code in Ring 0. Since NT 4. And we should. But you cannot blame MS for CrowdStrike's utter technical incompetence. Which is the case here. Just look at the csagent call stack code symbols and offsets in the PAGE FAULT crashes.. A singe thread call tree with > 3Meg byte code offset values. Anything over a few 100K byte offsets in code like this in the main thread is starting to push your luck. CrowdStrike must have just stuck all the code in a single thread with zero isolation and partitioning. Which MS (and other OS vendors) have been telling device driver writers not to do for the last 30 plus years,.
Maybe CrowdStrike should hire some people whose have actually read carefully (and understood) the DDK docs. And knows how WDM drivers / kernel code etc works on the bare iron. Its not like the relevant kernel source code has not been out in the wild for the last 20 years. If you want to know what is really going on. Because its pretty obvious that no one currently working for CrowdStrike has a clue about any of this stuff.
-
Tuesday 23rd July 2024 22:16 GMT elDog
Re: So why was table lookup done in pspSystemThread?
Guessing out of hubris and laziness.
Many a neophyte programmer thought it would be much easier to write code to stay in one protection level (kernel) than go through the hoops of having another process handle the real work.
Remember when everything we wrote was at level-0 on the early micro-PCs?
I grew up with IBM-360s and learned to love the GE-600 series master/slave levels. Then they build the Multics machines from whence (somewhat) Unix was spawned. I love having the hardware tell me that I f'd up without having to read through a full kernel dump.
-
Tuesday 23rd July 2024 22:31 GMT Anonymous Coward
Re: So why was table lookup done in pspSystemThread?... it was laziness
Have seen this scenario play out a few times over the years. I bet if you look at the source code for the various components you will find that CrowdStrike just took the MS DDK and SDK docs sample code for drivers and callbacks and just pasted in their own code as one big blob. Copy / Paste / Maybe Test it a bit / Ship It.
Then sooner or later it all blows up and they learn the hard way (if they dont go out of business first) that you cannot just paste code in wherever you think it might fit. The codebase will have to be properly architected into very strongly horizontal and vertical partitioned functional blocks. With very robust error detection and recovery. This is not Win32 application code were you can get away with sloppy code and any old crap. This will have to be embedded software quality code. Which is a whole different ballgame.
So you actually have to hire people who know what they are doing. Not dot com hires fresh out of college with fancy PhD's. Or outsourced to foreign body shops.
-
-
Tuesday 23rd July 2024 22:56 GMT Bebu
Re: So why was table lookup done in pspSystemThread?
Thank you.
Answered a lot of questions for me (not being a windows person ;)
Having observed crowdstrike software under linux I was quite sure MS wasn't the main culprit in this fiasco (for a change.)
Crowdstrike appear to be claiming the offending .sys file(s) were data and not executables which is a little disingenuous. If the data didn't alter the execution of their code then why load it? Possibly more accurate to think of their kernel module as an interpretor running in a kernel context whose code are these .sys files. A bit like a third rate version of eBPF I imagine.
-
Wednesday 24th July 2024 19:58 GMT Michael Wojcik
Re: So why was table lookup done in pspSystemThread?
Crowdstrike appear to be claiming the offending .sys file(s) were data and not executables which is a little disingenuous. If the data didn't alter the execution of their code then why load it?
Data does alter the execution of code, in the general case. That's the whole point of data. Perhaps you need to review the theory of computation.
Honestly, I cannot understand why some people are making this argument. To the extent that there's any distinction between code and data, this is precisely that distinction. TM is in state A, reads a datum, moves to state B according to the transition for that datum.
-
-
Wednesday 24th July 2024 08:05 GMT thosrtanner
Re: So why was table lookup done in pspSystemThread?
Totally agree with most of this (especially the insanity of only using 2 rings - kernel and user. Well, that's not fair, entirely, but proper access privileges - like device driver threads have privileges to write to *their* device pages and to read/write user memory WHERE THE USER has given permission (by making a system call asking for memory to be transferred to/from the device). Antivirus software afaics needs even less privilege than that, because, honestly, if your a/v stuff crashes - you need to know, sure, but you can carry on using your system (though disconnecting from the internet would seem a good idea).
But crowdstrikes code passed WHQL validation. And that is microsofts fault. device drivers that read files of disk and do things with them is not a great idea.
-
-
Tuesday 23rd July 2024 22:13 GMT TReko
You get what you pay for
In February 2024 Crowdstrike had layoffs in the USA and moved most tech jobs to India. They proudly announced this via a press release.
It's the same as Boeing's 737 MAX ACAS software being outsourced to $9 an hour jobs in India.
Unless very carefully managed, the savings are an illusion.
-
Wednesday 24th July 2024 05:36 GMT deadlockvictim
Re: You get what you pay for
I wouldn't diss the Indians.
My experience of them is that they are great programmers, at least the ones I have worked with, who are based in Chennai.
The problems with outsourcing companies is more of a socio-economic problem to the country that has lost jobs than to the quality of the work done.
-
Wednesday 24th July 2024 11:29 GMT theOtherJT
Re: You get what you pay for
I'm sure that there are a great many truly excellent programmers in India, they're just not working for the $9 an hour outsourcing outfits.
We've got several Indian programmers, some of whom even work from India, at my place of work and as far as I know they're all very good at their jobs. The kicker is that they work for us not some outsourcing agency we hired in to do the job on the cheap.
-
Wednesday 24th July 2024 19:59 GMT Michael Wojcik
Re: You get what you pay for
In February 2024 Crowdstrike had layoffs in the USA and moved most tech jobs to India.
Yes, but currently there's no public evidence this had anything to do with this disaster. The faulty logic may well have been in Crowdstrike's driver before February, and the broken process that allowed the release of the corrupt channel file may well have been in place as well.
There's no shortage of idiots and lazy people in the US. Blaming this on Indians, or on new hires, or whatever, is simple prejudice. It does suggest that Crowdstrike have priorities other than quality (you don't lay off a bunch of developers if you care about quality; that's how you lose institutional memory, among other things), but it is not evidence that the layoffs were in any way the cause of this incident.
-
Wednesday 24th July 2024 01:30 GMT that one in the corner
Gizza job
This whole thing is just Kurtz's SOP for getting his name into the press before leaving for a new job, just like when he left McAfee.
We all curse the very soil he walks upon, but for the money men in smoke-filled rooms: "Kurtz? Kurtz? I've heard that name somewhere, haven't I? Well, guess that means he's famous. Go ahead, lob some money at him, let's see what he can do. Got any more brandy?"
-
-
-
Wednesday 24th July 2024 19:59 GMT Michael Wojcik
Yes, but again, testing would have revealed that. As would an initial internal rollout before pushing it to customers, or a phased rollout, or any of the other things people have been suggesting.
Crowdstrike obviously screwed up badly, because they have bad practices. There's simply no other explanation, other than malice, which they've already disclaimed.
-
-
Wednesday 24th July 2024 07:56 GMT Anonymous Coward
Why was this flagged as boot-start
Surely only ESSENTIAL hardware drivers need that. Basically, ones that are NEEDED for the system to limp up, show an error message and do enough to let the user fix things. Most hardware will run in legacy modes too, so no high performance gaming video drivers, just the basic VGA one would suffice.
-
Wednesday 24th July 2024 12:07 GMT gnasher729
Re: Why was this flagged as boot-start
Obviously if your hard disk driver crashes, it’s game over. All you can do is try again and pray.
A simple technique for the problem here: if you have more than one configuration file, you write “parsing xyz” to the drive when you start parsing it, and remove the message after success. You also check for this message. If you find it before parsing xyz, you write “parsing xyz for the second time”. If you find that, you skip that file with a big error message.
That way you skip only one configuration file and only if it is a real problem.
-
-
Wednesday 24th July 2024 10:38 GMT ColinPa
Downloading stuff automatically
I believe that one reason why the problem was pervasive, is that the code downloads stuff from it's server, outside of any change control/freeze/lock down. The same way that Antivirus code down loads its stuff. This seems to be a fairly general pattern, which saves the user having to make a decision. For the unattended machines, there is no one to ask,so it is configured to do it automatically.
Trying to manage this model is a nightmare.
1) Change the firewall to disallow access to the server
2) On one day a week, open the path to the server for some of the machines, these machines get updated. Close it again
You cannot have N-1 and N-2 images.
3) Next week, open the path to some other machines, these get updated.
4) etc.
After some weeks all machines should have been updated.
Now manage this with multiple servers downloading fixes automatically.
Of course companies which have processes to distribute fixes and not allow automatic updates, and only allow access to approved sites should be better protected. But this is a lot of work.
One quote I read ... It is expensive to do it right. It is even more expensive to do it wrong.
-
Wednesday 24th July 2024 11:12 GMT Sceptic Tank
Address 0x000000000000009c
My memory is severely rusty on this topic and I don't have time / inclination to go study it up, but that address sound suspiciously like it could be a "page zero" access which is used in x86 / x64 to check for null pointer assignments. The Falcon people (haha "Falcon Heavy Rocket") do waffle about something null bytes or other in the channel file.
-
Wednesday 24th July 2024 17:16 GMT Anonymous Coward
Re: Address 0x000000000000009c
That was only one of the crashes. Other crashes had other addresses which were not in page zero. So we think it was an uninitialized pointer, not a null pointer.
But some right-wingers on Twitter said it was a null pointer caused by an incompetent disabled, Black or woman employee. What's wrong with some people :-(
-
Wednesday 24th July 2024 18:18 GMT Anonymous Coward
Re: Address 0x000000000000009c
But some right-wingers on Twitter said it was a null pointer caused by an incompetent disabled, Black or woman employee. What's wrong with some people :-(
I do not know, but apparently they post a lot on social media about how the fact that there are gay people, black people - and even female cartoon characters wearing trousers - means that there's some deep conspiracy to ... I don't know, make people polite or something.
-
-
Sunday 28th July 2024 04:21 GMT LessWileyCoyote
Re: Address 0x000000000000009c
Back in the days when I was programming on mainframes, an address with that many digits was an absolute address, i.e. relative to the total memory space of the machine. Address 9C would have been firmly in what we called "the bottom left-hand corner of the machine", where things like the system clock resided. I have no idea whether that concept translates in any way to the PC world, but I do know that if any process on a mainframe had high enough privileges to access that area, but was not the actual OS, everything stopped. Very quickly.
-
-
Wednesday 24th July 2024 12:17 GMT Andrew Mayo
If only there was a way to check if a memory access would fail
Oh wait.... https://kernelmode.info/forum/viewtopic0aa3.html?t=5317
There's an API for this - MmCopyMemory() - introduced way back in Win 8.1 - that allows driver code to verify that a memory address is valid without triggering a kernel trap.
So any good software engineer would rigorously validate the information in the channel file because - even if these files are signed by Crowdstrike (and I don't know if they *are* signed) - you can't be sure that some malicious actor didn't perhaps manage to perform some kind of supply-chain attack and leveraged a channel file to cause disruption. Obviously in this case it was human error but if the driver code had been written resiliently, this would never have happened.
-
Wednesday 24th July 2024 13:10 GMT Seajay
Why did they go EVERYTHING AT ONCE?
Surely a better system for releasing such updates would be a phased release system? You then combine that with a "phone home" to confirm update successful.
i.e. PC requests update, installs it, then confirms back that it's installed and everything is operating normally. Release system monitors the releases going out, and expected confirmations allowing it to quickly shut down a release if there is something amiss...?
-
-
Wednesday 24th July 2024 14:46 GMT Paul Anderson
Re: Canary releases? More like guinea pig
In my experience, most phased / canary updates choose their guinea pigs / early adopters arbitrarily. If your number comes up for the first phase of the update release and there is a bug, you're out of luck. You're an unwitting, paying beta tester and your live systems are on the line. Few customers consent to this or are even aware of it. I consider it to be unethical behaviour and I hate it. Worse still, when your systems go down they offer to 'help you' with the problem by getting you to run diagnostics and collect masses of log files. All of which they feed back into their beta testing process.
-
-
Wednesday 24th July 2024 13:23 GMT Andrew Mayo
Windows does of course have ETW
Event Tracing for Windows does provide conceptually similar functionality I think to what Linux and MacOS have. Unfortunately the API for ETW is rather horrible, and this blog post wonderfully explains why.
https://caseymuratori.com/blog_0025
(Ironically, I used Casey's blog post to actually understand how to use the API). But, as mentioned by others, this is not an interception mechanism. If you want to actually stop a blacklisted executable being run, then you need hooks at kernel level to prevent the process actually being created.
For this kind of thing, Microsoft provide the concept of 'filter drivers'. More specifically, minifilter drivers provide a fairly robust mechanism for intercepting all kinds of system operations, including process creation.
https://stackoverflow.com/questions/58420338/intercept-process-access-using-a-windows-minifilter-driver
So in fairness to MS they DO provide a decent set of tooling to hook AV/EDR software into the OS, but developers need to follow guidelines and build stuff carefully of course.
PS: Interestingly, AV/EDR software can cause performance issues due to a phenomenon I call 'resource amplification'. What happens is that with software intercepting low-level functions like registry accesses or process creation, the AV software obviously has to do stuff on each call. This takes CPU resources and in some cases other resources like disk I/O.
Now if the system gets busy, and moves close to the point where it's running out of resource, the AV software can 'amplify' this by becoming busy itself with the flurry of extra calls. This pushes a machine over the edge before it normally would reach saturation, causing significant performance issues. Data exfiltration software is a particular problem because it scans the content of files to ensure they don't contain sensitive information, an intrinsically expensive process.
-
Wednesday 24th July 2024 21:30 GMT Henry Wertz 1
kernel errors
linux still CAN have a full kernel panic. But I've had faulty drivers log nasty messages about derefernces, null pointers, etc, and the system does carry on, the kind of errors that used to cause a full panic can now just have it basically panic that individual driver and carry on. You know, depending on how important the driver was and if it scribbled over memory or screwed the system up further.
This MIGHT be why crowdstike on linux caused *some* panics and not a panic every time, it may have been logging nasty error messages and keeping the system up the rest of the time.
-
Thursday 25th July 2024 17:44 GMT thatstephen
I think it is obviously a long debate as to whether kernel access is necessary and how Windows can or can't control faulty code working at the kernel level. I think the far more important failure from Crowdstrike here is in the release 'process'.
'Cybersecurity provider', 'faulty patch release', '8 million bricked customers systems' should not be in the same sentence unless 'bankrupt' is also in the sentence.
The issues I have with their process are:
1) Even without an intentional canary release process there must have been some delay in the release process given that it was at Internet scale, the fact that a very large number of systems were hosed before anybody noticed must have meant that the release must have been an automated process that was unattended or at least very poorly attended. Customers pay a lot of money for services like Crowdstrike. The fact that they can do a release that can do such damage in such a way that they can't press a big red stop button on is not an error but fraud in my mind.
2) They are paid to provide protection from malware bringing down systems, does none of that 'protection' extend to remediation and bringing systems back up especially if the thing that brought your customers systems down is very familiar to your engineers because they made it.
3) Creating a bricked system from a release. How on earth does their patch system provide any protection from supply chain attacks or man in the middle attacks? !
It is also shocking to me that in 2024 when we have virtualization on everything, journaling filesystems, endless disc space to store rollbacks, boot protection at processor level etc. etc. That this can happen to 8 million PCs and servers and more importantly take a lot of time to remedy.