Whilst I'm sure there are things MS could do to minimize the impact of a repeat ClownStrike performance, let's not forget that ClownStrike made two fundamental errors here: No testing of the files before release and no client-side sanity checking of the data before being used in the kernel. If ClownStrike has done ONE of those, we wouldn't be in this mess. (And let's not forget the "Don't deploy on a Friday" either)
Post-CrowdStrike, Microsoft to discourage use of kernel drivers by security tools
Microsoft has vowed to reduce cybersecurity vendors' reliance on kernel-mode code, which was at the heart of the CrowdStrike super-snafu this month. Redmond shared a technical incident response write-up on Saturday – titled "Windows Security best practices for integrating and managing security tools" – in which veep for …
COMMENTS
-
-
Monday 29th July 2024 10:08 GMT Evil Auditor
While I agree with you, I'd also question the procedures on the customer side. Mind you, for a very long time I've been far from the front line and have happily dwelled on the umpteenth line of theoretical defense. But why would anyone deploy an update of any kind directly to business-critical production systems?
-
Monday 29th July 2024 11:21 GMT Anonymous Coward
I would theorize:
A critical vulnerability patch is released. The recommendation is to apply it ASAP.
1. You apply the patch ASAP, and risk bringing your systems down with a bad patch.
OR
2. You wait until your next patching cycle. In the meantime, your organization is hacked. Your business is compromised. You have to report it. You have to fix it. You may get a hefty fine.
All down to risk management. How many times have patches been applied and it's not caused an issue, as opposed to those that have that caused massive disruption.
-
Monday 29th July 2024 11:33 GMT big_D
Exactly, and, it is cheaper and easier to reboot a PC into safe more and remove a dodgy file, rather than having to replace thousands of devices, rebuild servers from clean system images, install and configure the software, then restore the data (and only the data) from backups - if they haven't been compromised).
-
Monday 29th July 2024 16:41 GMT sitta_europea
[quote]
I would theorize:
A critical vulnerability patch is released. The recommendation is to apply it ASAP.
1. You apply the patch ASAP, and risk bringing your systems down with a bad patch.
OR
2. You wait until your next patching cycle. In the meantime, your organization is hacked. Your business is compromised. You have to report it. You have to fix it. You may get a hefty fine.
All down to risk management. ...
[/quote]
Weeeeell... sort of, but the thing is I haven't yet seen a figure for the probablilty that - even when it's working as designed - this CrowdThink software is any better than any of the other packages which aim to detect malware. And the best I've seen still only manages to detect about 85% of the threats even on a good day.
So this kind of argument falls a bit flat. If you're relying entirely on a product like this then in the long run you haven't got a prayer.
My point is that there's really no excuse. For any of them.
-
Tuesday 30th July 2024 00:50 GMT Anonymous Coward
And that is why I said its risk management.
Would you rather take the risk of not applying the update and being unable to detect an increased percentage of the threats, over systems that have the updates applied. 85% threat detection is better than, say, 60%? How many times have CrowdStrike, for example, released updates that did not cause disruption, compared to the times it has? I wonder what that percentage would be?
By the way, I'm absolutely not excusing the failure in testing - I am merely pondering the question around how fast security related updates should be applied and any reasoning behind it.
-
-
Thursday 22nd August 2024 01:03 GMT RAMChYLD
This is why most IT Departments typically have one to a few "canary machines" - test environments to deploy to see if anything would go wrong before deploying it to "production", ie enterprise-wide. Untested updates gets deployed there first. If the updates doesn't BSOD the machine after an allotted time period, it goes out to everyone else.
Granted, it's not foolproof, but at least it catches and prevents snafus like what happened with CrowdStrike.
-
-
Monday 29th July 2024 11:31 GMT big_D
Because they are busines-critical production systems and therefore the prime target for attackers... The rollout of updates happens multiple times a day, most companies just don't have the staff to be constantly testing the definition updates, especially when only one in, say, 15,000 updates has any problem.
Checking that new versions of the driver are reliable can be done, because that probably only gets update every few months or a couple of times a year, but checking each of those "definition" files, which are essentially code repositories, multiple times a day is beyond the scope of most IT departments and they'd have a huge staff churn, as people would be sick to death of running hundreds of tests, releasing to production, running hundreds of tests, releasing to production... multiple times a day. They'd never have time to do anything else.
Even then, unless you were lucky, your tests probably wouldn't find anything, or you'd be constantly writing new tests as each definition file update would bring new changes to test against...
In an ideal world, yes, those updates would be tested, before being released, in practice, the IT department doesn't have the resources or the money to do it effectively and if they can manage, maybe 2-3 checks a week, they might as well not bother using MDR software in the first place, as it is often protecting against new, active threats that started within the last few hours.
-
Monday 29th July 2024 14:41 GMT Doctor Syntax
The updates are pulled in automatically so don't go through the IT department at all and if they did testing would be automated. Given that testing the specific functionality isn't going to be likely (do you want to keep samples of all the latest malware anywhere on your network?) about the only test is going to be determining whether it falls over or slows the system. Maybe some firewall rules that only allow intermittent access to the update server and not all the fleet at once might be possible, automateable and even allow for the inclusion of a sacrificial box as a trip-wire.
-
-
Monday 29th July 2024 13:43 GMT T. F. M. Reader
@EvilAuditor: why would anyone deploy an update of any kind directly to business-critical production systems?
Customers deployed Falcon agents that includes an in-kernel components/module/driver. The agents communicate with a management server that pushes malware signature updates ("templates" in Crowdstrike parlance) and the customer does not have any control over the process. Crowdstrike don't call it a software update, even though it is (clue: there is no difference between programs and data). Once in a while such an update crashes every machine it is pushed to, as it turns out.
Your question, were it not rhetorical, should be directed at Crowdstrike in this case.
-
Monday 29th July 2024 18:28 GMT Ken Hagan
"But why would anyone deploy an update of any kind directly to business-critical production systems?"
As I understand it, the "update" could not be blocked by system owners, since it wasn't considered "code" but merelt configuration data.
Therefore, the answer to your question is that the update was deployed to systems critical to someone else's business. That's a much easier decision!
-
Tuesday 30th July 2024 12:30 GMT cdegroot
The whole point of malware detection is to get the detecting code out faster than the baddies can use new avenues of attack. That's why there is a clock ticking and thus that level of automation.
The question is of course why we're collectively at sea in leaky buckets. System design has constantly preferred speed and features over security, we're paying a price for that trade off.
The question is whether that price is (too) high. I'd focus though on, say, the constant effects of ransomware over this very press-friendly single event to decide that.
-
-
-
Monday 29th July 2024 07:22 GMT simonlb
"Windows Security best practices for integrating and managing security tools"
This isn't rocket science, but if you're going to write a kernel driver which is parsing data from an external file for it to use, then you have it check the data is in the correct format and validate each data object as it is read in. You cannot assume the file has the correct data format and values. If anything being read in is out-of-scope or just plain wrong, reject it, flag it up, add an entry to your logfiles and gracefully move on. Do NOT, whatever you do, leave open the possibility of crashing.
That said, my own POV, which I've said before, is that the whole design of Windows is fundamentally flawed with far too many hacks, kludges and compromises that have been left unaddressed, or fudged to 'fix' over the years, and which now are almost impossible to fix without a complete rewrite. I mean, how can plugging in a USB device such as a keyboard, mouse or headset require a reboot of the machine to get it to work once the OS has installed the drivers? I've seen this numerous times over the years, but only on Windows machines, never on any of my Linux PC's and I've never heard anyone who uses a Mac ever mention it. That and 'forgetting' where the printer drivers have disappeared to overnight, amongst many other things.
-
Monday 29th July 2024 07:43 GMT A Non e-mouse
Re: "Windows Security best practices for integrating and managing security tools"
the whole design of Windows is fundamentally flawed with far too many hacks, kludges and compromises that have been left unaddressed, or fudged to 'fix' over the years, and which now are almost impossible to fix without a complete rewrite
To Microsoft, backwards compatibility is almost sacrosanct: This includes supporting software that did very dodgy things (e.g. abusing bugs)
To be fair, this isn't too far different to the policy for the Linux Kernel of "Don't break the user space".
-
Monday 29th July 2024 08:54 GMT Doctor Syntax
Re: "Windows Security best practices for integrating and managing security tools"
Backward compatibility is fine provided that what it's compatible with is good practice and documented as what's supposed to happen. That doesn't include things like use-after-free. It also doesn't include using a few features for their own applications but not documenting them for vendors of competing products.
-
-
Monday 29th July 2024 07:48 GMT abend0c4
Reboot of the machine to get it to work ...
I just unearthed my old Toshiba Libretto 50 which miraculously sprang to life running Windows 95. Merely to change the DNS server it needed to install software from the installation disks (or at least the copy on the hard disk, the installation disks being a couple of dozen floppies...) and reboot.
It's not that Windows hasn't changed (considerably) over the years, but the emphasis on backwards compatibility has led to an accumulation of deprecated features that still require (and complicate) support and maintenance. But that's a significant reason people continue to buy Windows - their software will continue to run. It's an unhappy compromise, but probably a commercial necessity.
-
-
-
-
Monday 29th July 2024 09:01 GMT Doctor Syntax
The instructions for fixing it implied it wasn't essential for booting into safe mode but that wasn't possible without manual entry of the key if Bitlocker was used. It seems there's a gap there between safely fetching keys from a server and not opening up networking to a degree that would be unacceptable without services such as CloudStrike.
-
-
Monday 29th July 2024 11:39 GMT big_D
That is how Windows already works and why most crash loops are recoverable.
But... The MDR tools are special drivers with an additional flag of "under no circumstances allow the PC to start without this being loaded". This is reserved for things like storage drivers, but also MDR tools, they have to be loaded first, in order to be able to detect whether malware is trying to get itself loaded. If the MDR is deactivated, or is loaded later, you might as not have bothered paying all that money for the protection, because it is useless.
The only option is for Microsoft to ban everything from Ring 0, apart from its own code. But, the anti-malware software will probably have to be the last thing to be banned from Ring 0...
-
Monday 29th July 2024 14:52 GMT Doctor Syntax
Surely the fact that the instructions for recovery by booting into safe mode indicates that there is a state where either the driver wasn't loaded or else it didn't try to read the offending file if it did. There's a halfway house of some sort. It will be a matter of risk management as to exactly what is now and what that might be in the future.
-
Monday 29th July 2024 19:49 GMT Adam Foxton
You're missing the point.
This should not have happened. CrowdStrike deliberately broke the safeguards MS had put in place to avoid this. They had a signed, thoroughly tested, Kernel-mode driver... that then ran unsigned, in this case untested, code. And didn't test that to make sure it rejected invalid data.
Regardless of the safeguards you put in place, someone who bypasses those safeguards can still screw up a system.
It's like the old saying- if you make things idiot-proof the Universe just makes a better idiot.
-
Monday 29th July 2024 22:11 GMT Roland6
You are,also missing the/a point.
Whilst the CrowdStrike event should not have happened, it does show that MS’s default BSoD and freeze isn’t good enough, particularly on servers.
The question is given the normal recovery action for a BSoD is to reboot and run tools like the system file checker, why MS haven’t already automated some of this.
I’m not talking about restore to full live but to a state useful to recovery by remote management systems and operators, namely Safe Mode plus networking.
-
Tuesday 30th July 2024 13:05 GMT Anonymous Coward
It wasn't a system file that was wrong, so that would have come back as everything being A-OK.
There is only one solution here, to treat Crowdstrike's claim of their system being WHQL certified when they were deliberately circumventing that as being fraudulent. This then allows them to be sued into oblivion. That would discourage people from trying this sort of idiotic crap again.
We just saw the same sort of thing with Dieselgate, where /technically/ it passed a test but in practice this wasn't the case. That cost VW €34Bn, and it hadn't lead to problems with airlines and hospitals and whatever else.
Remember, this wasn't an accident, or some accidental incompatibility. They made a deliberate choice to circumvent safety, security, and best practice while selling it as safe, secure, and stable. And while committing this fraudulent malfeasance they caused massive damage.
-
-
-
-
-
Monday 29th July 2024 12:08 GMT Howard Sway
Promises to discourage use of kernel drivers – so they don't crash the world again
That was a lot of waffle from the MS guy to basically say "please don't use some of these insecure features of our insecure OS from now on or you might break it again".
No recognition that it will ever be necessary to change the architecture and build robustness to the point that it's reliable enough to cope with driver errors in a non-crashy way. The mindset seems to be stuck in the meghaherz performance world of 25 years ago, rather than the gigaherz one we live in now, which means that the need for direct memory access through the kernel "for performance" isn't really a valid excuse any more : building in a couple of layers of extra protection could have provided enough isolation for safety with negligible performance impact, if you need to be running software like Crowdstrike in the first place (to protect against all the other vulnerabilities that Windows has).
-
Monday 29th July 2024 20:18 GMT Anonymous Coward
Re: Promises to discourage use of kernel drivers – so they don't crash the world again
This sort of thing also crashed Linuxes. Which it needed to be installed on because it also has vulnerabilities.
When something is running in the kernel and breaks in an unhandled way, the computer halts. That's the only safe behaviour. What would you rather it do, just keep running privileged code whose state and content is unknown? Yeah, that sounds like a great way to stay stable and secure.
MS provides a service where it'll test and verify Kernel mode drivers to make sure they're safe. This was put in place decades ago specifically to avoid this sort of occurrence- the likes of Windows 98 got a terrible reputation for reliability in part because of really really bad drivers doing exactly this.
So MS started providing verification and certification, and all was good for a long time. Windows is way more reliable now than it was in the Win9x days and a lot of that is down to this change.
Crowdstrike deliberately bypassed this, getting a kernel driver signed so everyone assumes it's tested and safe but then making it reach out to run other, unchecked, code.
This sort of thing will- and should- crash any OS worth its salt.
And DMA is pretty important for malware protection so it can see what other processes are running in memory. If you have antimalware software that's running entirely as a regular application in user-mode and sandboxed off from the rest of the system then it's utterly useless.
-
Tuesday 30th July 2024 12:43 GMT Adair
Re: Promises to discourage use of kernel drivers – so they don't crash the world again
It isn't really the crash that's the problem. It's the fiasco afterwards—the total lack of robustness to recover safely in some kind of fallback mode after a failed boot.
Windows is basically an all or nothing 'testing OS' being touted and used as an OS suitable for robust and secure frontline usage, where livelihoods and lives are at stake. It isn't.
Not that any OS is bullet-proof, but Windows really does stand out as an OS that is wilfully playing out of it's league, and people have been persuaded that that is okay and acceptable.
-
-
-
Monday 29th July 2024 14:23 GMT Anonymous Coward
Software megalith plans to work with anti-malware ecosystem
Why don't they fix their defective OS? To answer my own question: because it's unfixable. What's needed is a Manhattan Project that will come up with a number of hardware/software combinations that can't be compromised by opening an email attachment or clicking on a malicious weblink. Now that would be a true ecosystem.
-
Tuesday 30th July 2024 13:30 GMT Tridac
The only way to stop this happening again is to make it a requirement that all kernel related updates should be submitted to uSoft for testing and verification, before inclusion into production systems. Otherwise, how can they ever guarantee that the system is robust ?. Of course, this would cost money, but in a zero trust environment, all third party vendors must must be considered suspect, until proven otherwise....