Holidays
Still, at least it is the summer holidays in the UK now, so lots of parents off to look after the kids - less urgency to sort their kit out.
IT administrators are struggling to deal with the ongoing fallout from the faulty CrowdStrike file update. One spoke to The Register to share what it is like at the coalface. Speaking on condition of anonymity, the administrator, who is responsible for a fleet of devices, many of which are used within warehouses, told us: "It …
Yes, but servers could/should have snap shots so you can revert, or admins maybe able to get access for manual intervention (appreciate not always). Some servers will require care. This is where a company, you would hope, would be directing most of their resource to right now.
EUC is a big problem, especially with WFH., In the old days, if this happened we could walk into a floor of users, hit a few banks of desks at the same time and get through it a lot quicker. Can't do that over the phone, and with offices now utilising a lot of hot desks, you may go to 2 banks on Monday morning and find 20 non working users, but on Tuesday it maybe a a scattering of 3 over the same 2 banks - so you are almost doing a 1to1 fix, not enmass
As a lot of people will now be taking time off, there is going to be less stress for the IT teams as quite a few of their users will be away coimpared to everyone being, or coming in and demanding to be fixed NOW.
"as quite a few of their users will be away"
So IT support will be getting calls for a few weeks when WFH users return and find that their PC has been frozen while on vacation. The lucky users shut the thing off and will miss the FUBAR ClownStrike update altogether.
It could have been worse. In the old days of CRTs, users could return to a BSOD burned in to the phosphor.
I run a script triggered by logoff, sceen lock or disconnect from rdp from schedule task that changes the gateway so my system stays only on the local net when not in use.
When logged back on, another script that puts the gateway back.
Anything I want my system to connect to while logged off is provided by adding the specific route to the script.
Maybe I was not so paranoid after all.
Am seeing a lot of 'holier than thou' posts on Twitter from Mac and Linux users castigating anyone with a Windows PC and laughing.
This sort of thing can happen to anyone and it's only a matter of time before it happens to you, regardless of what flavor OS you chose.
Stop being a d*ck.
Try running Checkpoint on Ubuntu. Periodic desktop freezes, and an unbootable machine after an update. Only fixable as a) I'm a dev, and can get into grub to boot an old kernel, and b) drive is encrypted with LUKs, so i just needed the normal unlock password to mount the root partition to enable the grub menu.
We have 1 client running crowdshit, their production systems have been down all day.
The client was extremely happy with their IT and had no complaints about any of their computers. They were forced to accept another IT team taking over due to corporate politics.
The other IT team forced us to remove the Microsoft antivirus and install crowdshit. Users immediately started complaining that their computers were constantly crashing and grinding to a halt. The other IT team confirmed that it was a known issue with crowdshit that they have with all of their systems. Their solution was "when your computer gets so slow it is unusable, reboot".
They literally will not give us the codes required to uninstall it, even temporarily for testing, because they know the operations manager will order us to "get that shit off of every computer" and we will happily comply.
The other IT team fucked up big time with this incident. They messaged the users directly, taking full responsibility for the issue and stating that they were working on a fix. Of course they forgot the part where they are meant to follow up with instructions for how to fix the computers. They also forgot to transfer some of the responsibility on to us by like telling us about it or something.
We are probably the only IT company who spent today looking at a list of offline computers and laughing our arses off!
I wish I could listen in on the "why are we STILL down???" phone call the tech-literate manager who knows the fix and what time it was announced will be making on Monday...
(same poster)
It's actually an in-house team replacing the outside vendor (us) that used to have "sole jurisdiction" over one part of the business...
The 2 IT teams issue is due to the migration having been on hold half way through due to "capacity issues" limiting their ability to handle the increased support demands since crowdstrike was installed on the machines. Their IT team is several times bigger than our company.
I have not personally been involved in that side of things, I discovered the reason they were citing capacity issues for a client that used to send 2-3 support tickets a week only today when the subject came up...
It affected just one of my clients as well. Their parent company had insisted CrowdStrike was installed. A little bit of serendipity helped them. They're only a small company and I'm a lone IT support engineer. So they don't have automated tools for updates/installs - the users had to install CrowdStrike manually. I've been on extended holidays for 3 weeks and only 1/3rd of them had installed it. If I'd not been off the grid, I'd have been hassling the others to install and therefore the impact would have been bigger.
Still they're having to pay me for a call out on Monday to rebuild one of the desktops which they just can't get into recovery mode/remember Bitlocker key.
While I understand the sentiment of your comment - people shouldn’t gloat about this stuff - I’m afraid (actually I’m not) it is almost ALWAYS Windows that’s fucked up by this sort of thing. It’s NEVER any other OS (well it might be, but the percentage is so low it’s not even statistical noise)
The fact that anyone feels the need to run stuff like Cloudstrike just to keep the OS up and running is a very long-standing joke - there is something deeply wrong with an OS that needs this stuff to keep it going
While I understand the sentiment of your comment - people shouldn’t gloat about this stuff - I’m afraid (actually I’m not) it is almost ALWAYS Windows that’s fucked up by this sort of thing. It’s NEVER any other OS (well it might be, but the percentage is so low it’s not even statistical noise)
I have a system where pacman said "hold my beer"..... Apparently it doesn't have a dependency tree for libraries? To avoid removing ones that other packages rely on. And pacman-static doesn't help when the server no longer finds the NIC. IE it knows the hardware is there, but won't assign an interface to it.
Ah, the crux of the issue with Crowdstrike...why do you need it at all? Maybe some CxO in Mahagony Row fell for the marketing? Maybe some faceless "cybersecurity" auditor recommended it as best-of-breed? Maybe some insurance underwriter demanded it?
The way I see it, most of this garbage should already be part of the OS, not some add-on. BUT NOOOO...we have to "let the market determine what's best". If Microsoft hadn't been forced by legal decree to write in hooks for 3rd-party "security products" we'd have a smaller pool of idiots to blame for this kind of cock-up. A pool of one: Microsoft. Certainly, that'd sharpen up peoples' principals knowing that trusting Windows means trusting Microsoft, and only Microsoft, with your crown jewels.
The way it stands now, people will throw whatever 3rd-party product that promises to make up for the lack of Windows security into their infrastructure and sleep well at night knowing that all that money they spent doing so isn't going to Microsoft, but to a vendor that is "smarter" and "more agile". If that were the case, why aren't these vendors PUBLISHING safer, more more secure OSes? See the problem?
I'm not advocating for the demise of Windows but simply pointing out the rather obvious fact that a Zero-Trust Architecture means you don't trust anything, INCLUDING ALL OF YOUR SOFTWARE AND SERVICES VENDORS.
I'm hoping that this is a wake-up call to those who think that writing a check means that they don't have to think about security anymore.
"The way I see it, most of this garbage should already be part of the OS, not some add-on."
You want the same people who wrote the 'Flaky OS' to also write the 'defense against the arts' software as well !!!???
They cannot write an OS without so many holes/issues/bugs *but* apparently *can* write the 'defense against the arts' software that protects those holes/issues/bugs.
If they were that good, why not 'fix' the OS in the first place !!!???
Going 3rd party is reasonable [Different set of eyes/minds etc] *BUT* do not trust any claims 100%, all software vendors lie.
Not because they are bad *but* that is the industry we have allowed to grow, we accept these lies everyday, pay good money for them & come back for more !!!
---------------------------------------------------------------------------------------
"I'm not advocating for the demise of Windows but simply pointing out the rather obvious fact that a Zero-Trust Architecture means you don't trust anything, INCLUDING ALL OF YOUR SOFTWARE AND SERVICES VENDORS."
Correct, you test for everything you can and have protection in depth.
i.e. Don't trust the claims of *anyone* and have backups/roll-back images/etc to allow you to recover from *failed* recoveries by the 'defense of the arts' software.
THIS SHOULD BE THE PRIMARY LESSON THAT HAS BEEN TAUGHT BY THIS FIASCO !!!
:)
That's a joke in itself. To get CE+ compliance you have to install the latest browser. When we had the scan done we failed initially because Google released a Chrome update the day before the test and we hadn't packaged the replacement up yet. We also had to remove IE from all the machines being scanned because their scan detected it was "out of date" despite the fact that MS had stopped updating it & were forcing everyone onto Chrome.
But we could pass because a) the testers let us choose which laptops to test and 2) we could run the test as many times as we liked until it passed, and then give them the pass results only.
Luck not judgment is why other stuff is not affected.
Othe operating systems also need security solutions but there is this occult that believe all non Windows operating systems are invulnerable. They are not, particularly as the attack vectors are increasingly moving to the applications.
Complacency just because this group thinks they are somehow invulnerable is what leads to catastrophe. Everyone needs to sharpen up on this and infosec teams need to start listening more to technical teams and not sales people or Gartner.
Infosec Teams are the cause of some crazy risks because all that matters is ticks in boxes.
The entire debacle should be a huge wake up call (on top off the other recent attacks) to the tech sector. Sadly that is unlikely to happen, no lessons have been learnt from previous fiascos. That CrowdStrike are likely to escape without being sued out of existence is even more depressing.
Systemd, the reason I dumped Linux for anything serious. Offends every software engineering principle. If you look at the sources, it's a real mess of a trainwreck. Looked the one module, network.c, from memory and it's a thousands of lines of C module, pulling in upards of 100 header files. Absolute garbage, and big business depends on this ?. Originally from a Solaris and VMS environment, when an os was written by engineers, for engineers. FreeBSD for several years now, and never a serious issue at all.
When you consider that crowdshite is an enterprise class employee spyware program, looks lie karma finally caught up with them. Serves them right, hit hard in the pocket, is the only thing they understand...
About the best I can say about Linux, is that a bad update generally only takes down one specific type of hardware or configuration at a time, and the rest are ok. So at least a bug in the 5.15 kernel's amdgpu driver only took down twenty machines last week, instead of the whole estate. Although I do deserve some of the blame for not testing every possible hardware combo.
phuzz,
Blame is exactly what you should accept !!!
Why the hell do you roll out something you do not test ???
You get away with this once/twice maybe more if you are *very* lucky ...
*BUT* you will be bitten eventually ...
at the worse time and it *is* ALL your fault.
Every IT techie I have ever worked with thinks they are the *BEST* and *infallible*.
I lack that ego and can make mistakes, like everyone else, *but* not because I think I cannot make them !!!
:)
I wasn't thinking I was "*BEST* and *infallible*", I was thinking that I've never seen a Linux kernel update (and this was a mainstream release, not a beta/canary/unstable release) fuck a computer hard enough to stop it booting. Turns out kernel devs are just as fallible as the rest of us.
Whereas I've seen it (very rarely) on RHEL6 boxes where an improperly installed kernel update prevented the machines from booting - booting into the previous kernel and reinstalling the new one resolved it 100% of the time but that machine is then down until someone intervenes.
This has nothing to do with brown envelopes. Linux fails despite being being free because the lack of cost is not enough to offset the missing functionality in software that businesses need, and therefore are willing to buy software that saves them much more money than the purchase or license cost.
My home PC is a Mac and I'm not one of the annoying fanboi's who insists that Macs are safe from everything including a direct hit from a nuclear missile because <insert spurious belief>. I'm well that I'm just one lazy click away from days of pain so, I just brought forward my quarterly "off-site" backup (it's in a plastic bag in a jam jar in the shed) and updated the reminder to do it monthly instead of quarterly. I've also made sure that the written copy of the disc encryption key is where I thought it should be and, just to be safe, done a separate backup of my password manager to an encrypted stick and hidden it somewhere. Tomorrow I'll be getting my fallback MBA out of the loft and getting the files up to date.
My shed backup has a Time Machine backup on it and an rsync copy of my files (photos, docs, music, etc). I have a number of other Time Machine backups running to Apple devices and a NAS and also some stashed removables, but I also run rsync copies as well. This comes from a hard lesson the first time I tried to restore a brand new Mac from Time Machine because the OS's were so far apart that the new Mac required the old Mac to be updated before it continued. I couldn't do this because the old Mac wasn't compatible with the latest OS. I can understand that settings and apps from an older OS might not be compatible but it wouldn't let me bring anything across. I cabled the macs together and copied across my files but if my old Mac had died completely I would probably have lost data, although I assume there would have been some way to access the Time Machine files. Since then my personal backup policy has been Time Machine dailies plus Time machine and rsync monthlies to removeable drives. I also keep the Mac OS up to date (I had a bee in my bonnet about something in the above situation and refused to update), although I always wait a week after it comes out.
I'm not slagging Time Machine off - it's easy to set up, reliable and invisible to the user once running and rolling files back to earlier states is easy and the UI is oddly satisfying.
I'm sure there's a pithy saying for this in IT circles, but if you've never tried to recover from a backup then you can't really be sure it's a backup.
I had this issue at work recently as the person most likely to solve computer problems. I actually did manage to convince an old Mac to go to a newer - not current, it’s about 12 years old - version and then restore everything. It required fully wiping and reformatting it, then updating the “to” Mac.
It was certainly harder than I wanted and even expected, given how easy TM is to set up.
I’ve never had to do the same to Windows; I don’t imagine that’s great fun either.
You can actually write your own timemachine in shell -- I did this in 1996.
Basically, it's just an ongoing series of datestamped folders, each with the entire filetree underneath them but hardlinks for unchanged files rather than copies.
You use find
to walk your entire current filetree* and at each node look at the most recent datestamped copy. If it doesn't exist in the prior run, you use cp
/mkdir
to copy this node into the new filetree; if it's a file and exists but has changed, you copy it into the new filetree. If it exists AND hasn't changed, you use ln
to create another catalogue entry for it in the new filetree.
That's it.
.
* (Put some filters on the find
for handling OS "special files" if you're doing whole-machine replication.)
The nice thing about doing it manually is you can then create custom schedules for people with special needs. Eg, graphics/video artists on big projects wanting, say, 30min backups intraday, then 4hourly for yesterday, then daily for a week, then weekly for 2 months, then collapse to more-normal. This becomes a simple case of 30min TMs and a culling script to run daily.
I have vague recollections of rsync
having an option to do this built-in, too. That is, hard-linking unchanged "copies" rather than re-creating the file as a new copy. So that could be worth looking at if you already have an rsync
process set up.
Been using zfs right from the early Solaris 10 release. FreeBSD for several years now, which has zfs and lightweight virtualisation (jails) out of the box and looks like it was modelled on the Solaris ideas. Always chose lts versions for initial install. Once a system is stable, all the all the required packages etc, never update anything, but the machines are on a well secured subnet, with any windows rubbsih on a separate subnet on its own. Separate hardware interface for each subnet. Never had a virus or successfull attack in over two decades now. Would never even consider windows for server work, or anything critical to the business. More trouble than it's worth...
That's the excuse Microsoft tries to sell you (every damn time), but then the failure rates would match up with the distribution between Operating Systems and it does not, not by a long shot.
The MTBC (Mean Time Between Cockups) of Microsoft products in general is way shorter than on other platforms, and it's easy to see why. If people keep buying it anyway, why bother? That's also why they could do away with testing - users now do it. The quality of their new products provides ample demonstration: 'new' Teams and Outlook were so bad they should not even have been released as a beta, now they don't care and sort of fix it on the fly.
Just like in science, progress will occur one dead body at a time. The MS crowd displaced the mainframers as the dominant IT admins. Most of them have never used or tried another OS. The younger admins I meet have grown up in a world of Linux, Android, macOS, iOS, and even the BSDs, where MS's dominance is only in business IT and some vertical markets. When their bosses retire one way or the other, things will change.
With a bit of luck InfoSec should new be looking at the decision and asking if it increases or decreases the risk.
Or even if they understand the risk-
"We can't boot into safe mode because our BitLocker keys are stored inside of a service that we can't login to because our AD is down."
Oh dear, how sad, never mind. This kind of failure should not be possible. Where are the keys? What happens if/when you can't access that server? Wouldn't it be a really good idea if those critical keys were somewhere where you could access them, if/when your AD has a bad day? Which has happened numerous times in the past.
Speak for yourself
For those of us who are boring and get on with IT and not absorbed in the latest fads and AI willy waving designing around really bad scenarios is entirely what you do.
Plan for the worst, hope for something less intense to go wrong... But expect something to absolutely go wrong.
"We can't boot into safe mode because our BitLocker keys are stored inside of a service that we can't login to because our AD is down."
On my site I opened the (hard copy, in the safe) "oh shit" file to get the relevant local admin password, left my office, walked briskly down the corridor to the onsite server room and got one DC up in safe mode to do the fix. Then I changed that single-use local admin password before heading back to my office and updating the hard copy file with the new password before locking it away again.
Meanwhile the rest of my team were making use of the one DC I'd resurrected to get all the other impacted servers up into "safe mode with networking" now that they could talk to a DC, allowing them to login with their domain admin accounts AND access the bitlocker keys and perform the fix.
Once we had the AD infrastructure up and running the desktop support folks went into high gear busily fixing all the impacted workstations and laptops
A similar story played out on all my employers sites worldwide and we had pretty much every server - even the non-critical ones - back online before noon UTC and 99% of workstations and laptops fixed by midafternoon. None of which would have happened that fast without that hard copy file. Sometimes the best tech solution is decidedly low-tech :)
""We can't boot into safe mode because our BitLocker keys are stored inside of a service that we can't login to because our AD is down.""
I could not get my head around ^^^^
Something *so* critical is *NOT* to be stored inside something that can fail and render access impossible !!!
At a minimum you should *also* have copies on multiple media [not just usb sticks please !!!] that can be read by the simplest OS or even a piece of software like 'Winhex'/'dd'.
No extra encryption just plain old phyical security .... i.e. put it in a safe ... in multiple *SAFE* places. [Pun intended !!!]
:)
"Tell me you haven't got the first notion about the corporate world etc"
At every large company I've worked at, upper management with no IT knowledge have forced through "upgrades" that involve switching to windows.
You don't get cheaper alternatives with big PR firms smooshing the clients with fancy perks and gifts.
You mean Micro$hit. Because that's what it is, a shit excuse for an OS.
And the "modern" version keeps being enshittified, every release is worse than the last now. The latest "modern" Micro$hit "innovation" is putting AI spyware in every machine.
The ONLY reason the "corporate world" is still using Micro$hit at all is inertia and bribery. The only reason IT likes Windoze is because it's far more work to keep it from constantly falling over, it's a misguided attempt at job security. And these massive worldwide Micro$hit Windoze failures are the result, because corporate replaced people with 'management' garbage software.
Maybe if all those open source solutions provided something that actually competed including with commercial support then it would make a difference.
The alternatives are simply to disorganised even after decades of trying. Commercial users want SLAs, contracts. support organisations and integrations.
It is very little to do with bribery. We use both Windows and Linux. Linux has to be a commercial distribution, that makes it the same as Microsoft. The product runs on both platforms however there are still things that are far better on Windows.
Redhat pushed out an update a while back that broke grub and required manual intervention to fix any system that rebooted after applying the update.
And that was the OS vendor.
This wasn't even Microsoft, but a third party.
I've also had various other updates break services on Linux VMs, so no OS is immune to these things.
"We can't boot into safe mode because our BitLocker keys are stored inside of a service that we can't login to because our AD is down."
For as much as Crowdstrike have royally fucked up here, hopefully from Monday there will be a deep discussion and investigation in to why so many companies have been caught with their trousers down regarding disaster recovery.
But yet again, for all the techs having to deal with this, here's a pint.
We need a flying pig icon!
Speaking of pork. I've heard there are desperate recruiters offering £10k a day + expenses for anyone with a passport that can start now. Also possibly travelling by private jet given airlines seem to have been hit pretty hard. Kinda curious what risks that might have, ie which systems have been knocked out affecting airlines, so whether private pilots could file flight plans, manifests, or just end up caught in the same mess.
I took a look at FlightRadar yesterday afternoon. Traffic was a bit light but still reasonably busy. One thing that struck me when I looked was the track on one of the planes coming into Manchester. It had executed a peculiar loop around Hyde which is where they normally line up for the runway and a following plane had executed a loop a bit further back, neither in the usual holding locations. Clearly something had temporarily held things back. Whether or not it was Cloudstrike I don't know but I've not seen that one before.
All disaster recovery plans shall be reviewed and tested, all software releases/updates shall also be reviewed and tested, and all pigs shall be fed and ready to fly!
This also applies to ClownStrike in spades. Especially if any of their customers have managed to get testing and consequential losses written into their contracts. ClownStrike knows their users environments cos it has it's software on those systems monitoring this. So how ClownStrike managed to push an update without noticing this bug is something of a mystery, especially as it's managed to infect and corrupt thousands of customer's systems.
Automated updates from a SaaS service like CrowdStrike cannot be tested in isolation.
Why not? You have a list of sacrificial test systems which are permitted to get the update on day 1. After it's complete they run a functional test suite to make sure that all is behaving as expected.
After that, a report is sent to the admin who can decide whether the automated update is provided to the production systems. If you're willing to take the risk you could make that second phase an "automatic unless the admin says no" option, although I wouldn't.
> why so many companies have been caught with their trousers down regarding disaster recovery.
If the client is using Crowdstrike, it is Crowdstrike's responsibility to explain DR from a worst case, and make sure it's tested in the client environment.
Crowdstrike has caused an unrecoverable error- which is their responsibility to predict.
If their crappy .sys file causes a BSOD, how does the client machine recover?
Only the hardware should be capable of preventing recovery.
This is totally on them.
So that's like encrypting a key required for recovery with a copy of that key, and then deleting the copy? OK, yes, I was assuming a certain level of common sense competence from the client :)
My point is that it should be part of Crowdstrike's responsibility to the client to consider what happens if/when their .sys causes a kernel level exception.
You're right about it being Clowdstrike's responsibility to tell the client how there could be an issue with .sys files.
But on the admin side, if you have devices on your network that use Bitlocker you need to consider how you would get those keys. In a perfect storm scenario, you have to consider that while normally your laptops (as an example) can be unlocked using a key from the server, what if the server is in a burning building? What if it's been stolen? What if the TPM chip on a server* breaks? How do you get your key then.
I think for many they either didn't consider the server being offline or that it would always be available.
*I don't know if a server would have a TPM chip, but I've had a laptop that was borked by a broken TPM chip and needing to unlock the Bitlocker on the drive.
Repeat after me, you do not bitlocker your MBAM server.
You have as many virtual Domain Controllers as needed, and always at least one physical (perhaps 2). You do not bitlocker this Domain Controller. This domain controller is behind YOUR lock and key and beyond the reach of any PFY.
You have 3 or 5 MSX servers but at least 2 consoles with the tools on them
Your CRL list must be accessible by HTTP and not just HTTPS
Do not store the backups on the same premises as the resource it secures
Untested restores are just well stored and secured entropy
Do not let your 2 most senior admins travel in the same vehicle
2 part forest administrator password should be in safe/bank vault, twice but together!
Murphy is not to be fecked with.
Most servers sold now have integrated TPM2.0 chips - you'd mostly use those at the "bare metal" OS level though. If you want to put a virtual TPM on a virtual machine (I.E. if you're running VMWare or Hyper-V or KVM or whatever on that physical server) you can do that without having a physical TPM - I've done it at home on my KVM rig and the "TPM" is just a virtual hardware device with the keys etc stored as files on the hard drive.
Yes, if the other two servers are running Linux and FreeBSD, or maybe two different Linux distros.
On FreeBSD it is super easy to get a zfs pool back online on completely different hardware. The software side of things took me about 10 minutes last time I did it. The hardware stuff took a bit longer. I would imagine Linux takes a similar amount of time if it is a distro you know inside out.
That kind of stupidity is *bad* !!!
Simply thinking through the downsides of such a design should have stopped this before it was implemented !!!
Trust nothing, Hardware, Software, Physical security etc etc
IT is not magic and flaws in design and thinking are the reason the cybercrime exists ... Try to think like a Cybercriminal and how you would break through your security ... if you cannot do this employ someone who can and will put *their* reputation on the line to back it up !!!
:)
You bought the Crowdstrike 'cool aid' and cannot not abdicate any responsibility !!!
No matter what a 3rd party promises, it is your responsibility to ensure it is real and works when the 'brown stuff' hits the fan !!!
This does present a possible *new* and valid scenario to plan and test for in the future .... if any good can come out of this total mess !!!
:)
From what I see the crass stupidity of Infosec teams has no limits.
They will follow a rabbit Warren of security risks and mitigation to protect systems and data to the point that the very thing they are using to protect something requires a key component in what is being protected to work.
How about using Azure KeyVault to encrypt all your backups in Azure in case you lose your Azure tenant? Or storing said backups in the same tenant?
I see it all the time, when you ask questions about how they can recover from this "oh, it is in the cloud, it is safe".
The same for AWS and Google keys.
This post has been deleted by its author
@wolftone: Manglement do understand one thing. "Those who control the past, control the future". This current event will be swept from corporate memory, if it ever gets there, be "redefined as IT admin failure at best and forgotten, suppressed, distorted out of recognition. Nothing will change. An IT monoculture will be even more enforced to allow simplified, centralised control. Then the real outage will occur, for which this was the dry run according to my suspicions. Who benefits from seeing outage happen on this scale ?
The LAST person who should be blamed is any techie who pushed the big 'Go' button. There should be a zillion fail safes before it gets to that step, and it is the management/execs who are responsible for making sure those are in place. The engineer who pushes the release button (metaphorically) should not even have to know those upstream fail safes even exist. Those fail safes should be tested and audited and reported on. So one of three things was happening;
1. They had no fail safes and the engineers lied about this to management. However, management should have ensured proof of testing, etc. So the engineers either lied big time and falsified test results or the management took a "don't ask" approach.
2. Management knew testing was lacking or absent but ignored this
3. Management were so dumb that they didn't even think testing was required
So, in summary, unless we assume the front li e engineers spent more time hiding fundamental failings of process than actually doing their real job (which I seriously doubt) then the blame lies with management and the culture they have built the the company. Oh yeah, and you can include MS in that as they either knew about this or just plainly ignored something fundamental to their business and CUSTOMERS!!
The first person who WILL be blamed. Cynicism is bred from bitter experience
Clearly CrowdStrike believed that any update to a mere data file would be safe, and didn't bother to enforce any testing on them, perhaps believing that it was better to update them quickly to address new threats rather than delay their release due to testing. Personally I think this is a secondary problem compared to the apparent fact that they had never tested a corrupted data file against their system-critical kernel module..
For the kernel module to ingest a bad file and cause a BSOD, it would have to: a) not bother to fully validate the file before ingesting it, AND EITHER b) contain a memory-corruption or similar bug that causes a BSOD when processing a bad file OR c) very poor error-handling such that when a bad file is encountered it BSODs instead of simply logging the issue and rejecting the file
Yes, completely agree that we know who will be blamed and it won't be the execs!
As for testing, and the impact to getting updates out quickly... IMO (and experience), there is ABSOLUTELY NO EXCUSE for not having automated testing that validates deployment and basic function as a minimum, and this should not cause any meaningful delay to getting updates out. They clearly do not have such a thing or it is broken big time.
"perhaps believing that it was better to update them quickly to address new threats rather than delay their release due to testing"
And this file that was so urgently required as to have to be released without testing can, as a workaround, be simply deleted without waiting for a replacement.
My understanding is that as it's a routinely updated file, when the system rebooted the software would automatically check for a newer version, download the non-borked version, and the system would be fully functional again after a minute or so? Obviously it needs a more serious software update to fix the underlying bug and stop this happening again, but you can wait a few days for that (while testing it properly)
So I suspect the real problem here is how to contain kernel-level processes when they go rogue. This isn’t limited to Windows; I tried installing AMD video drivers on Ubuntu and it made such a mess I couldn’t reverse it and had to do a clean re install.
Yes CS should take some responsibility for not parsing the junk update and rolling back to a known good one while flagging the issue to central, but better minds than mine need to look at how to protect against this.
While we’re here …. This is the result of using Windows where it doesn’t belong. It’s a desktop and server OS, not a web kiosk and not a IoT thing. There are far better distros for that with small footprint and smaller attack surface (my favourite is Porteus Kiosk but others are available) but of course nobody wants to go there. Maybe they will now.
"That is not how CrowdStrike and similar solutions work."
But, now as a consequence *we* all know how Crowdstrike can be made *not* to work and how Crowdstrike will continue to work if you delete a *needed* *.sys file.
Crowdstrike have not thought that they could produce a flawed *.sys file and the use/error handling of that file *obviously* does not play well with windows !!!
Many changes are needed to address this !!!
:)
I am always wary of vendors with high profile high cost sports sponsorship, all that profit comes from a product that could either have more spent on R&D or be significantly less expensive. Of course the C suite who go to the F1, Golf, Soccer, Olympics whatever then associate these brands (Darktrace is another...) with success, not understanding that (1) no EDR/MDR solution means you'll be 100% safe and (2) cost is not quality.
I remember the quip from a comedian (think it was Mike harding) about seeing a photo in a magazine of a Durex sponsored racing car with a puncture. He was laughing his head off but the local Auzzies didn't get the joke. Turned out that Durex was the generic name for cellotape in Australia!
Pretty certain it was Carrot - it goes something like He sees the picture of F1 car sponsored by Durex with puncture on trip down under, finds the Australians don't find it funny (not a titter) because the brand is tape over there. Then he goes on to talk about an if an English overhears an Aussie asking for a roll of Durex, giant size - I think it finishes with him wanting to hear Aussie visiting a UK shop.
I'm sure some grossly overpaid "consultants" are already frantically trying to spin this as an argument against remote work. There are PowerPoint charts being drafted labeled "500% faster return-to-productivity after global Crowdstrike outage for in-person staff due to valuable centralized office attendance!" with stock image photos of men in suits sitting in a boardroom.
In reality it's lines of idle in-office workers crowding the IT department in a disordered line, each asking "is it fixed yet? Can we go home until it is?"
Is it too much to hope that when the dust settles legislators will start requiring that major infrastructure failures will, by statute, be followed up by an inquiry to determine what led up to the incident; decisions were made which impacted release of faulty S/W & so on. Bad decisions, especially those undocumented or made to cut costs would then lead to prosecution of those who made them.
“We can't boot into safe mode because our BitLocker keys are stored inside of a service that we can't login to because our AD is down”
Doesn’t this just sum up the ridiculous overly-complex intertwined mess that modern systems have turned into.
We used to have plain text files, and simple login procedures. Now we have 2FA and you need to talk to some remote server using some excruciatingly complex chain of certificates and crazy protocols just to power your local machine on!
I know some of this stuff is (in theory) useful but it’s just too fragile and outrageously complex
Linux user here but same sentiment: my mobile devices that are likely to get lost/stolen have encrypted disks, my rack-mount kit that is far less likely, and usually needs to reboot automatically, is not using such boot-level restrictions & encryption.
While MS & Crowdstrike are the obvious and justifiable whipping boys here on multiple levels, there is a major aspect of general resilience to be considered that is independent of them on how to recover from an IT disaster of any sort (screw-up, attack, or just natural disaster). So many have a "hope it won't happen" plan.
I likewise had never previously heard of this CrowdStrike company (I don't sully myself with Windows stuff much), but, in that way that your brain tends to do word association when you hear or see a new name/word, the first thought that came to mind was "Is that sort of like flystrike?"
Where is it engraved in stone that outside companies can reach into your computer and silently alter its software without either telling you or asking permission?
The fundamental problem is an unstable operating environment that has to be repeatedly patched because of FUD -- its the ideal self-sustaining ecosystem. If it was designed properly in the first place then there would be no need for constant updating. Sure, it would put a lot of people out of work but then, seriously, what are they doing that's productive?
Its worth noting that countries which can't or won't be served with constant updates due to sanctions and the like -- places like Russia and China -- seem to be unaffected by this problem.
"Where is it engraved in stone that outside companies can reach into your computer and silently alter its software". The conditions you agreed to when you chose to install the software.
Most AV software gives you the option to automatically install updates; you aren't forced to do so.
Most AV software gives you the option to automatically install updates; you aren't forced to do so.
Sometimes you can be. Like it's considered 'best practice' to keep software patched and up to date. It's also often specified in customer bids that the bidder will do this. Then when you add costs for a test environment and staff costs, they say that's too expensive and want it taken out. Which is doable (under duress), if you insert clauses stating you are not liable for any loss or damages due to the lack of an adequate test environment. Which sales will then object to as being 'too negative'.
Then there's insurance. Hopefully everyone that's bid ClownStrike has checked their liability insurance. Especially corporate officers and their DOI policies. This is shaping up to being a very expensive mistake, and insurers don't like paying out. So there can be a bit of a Catch-22. Don't apply updates right away, get hacked, and no payout because the company failed to secure their systems. Apply the patch and stuff breaks, no payout because the company should have tested it first. This may be Clownstrike's problem because obviously they pushed an update without properly testing it first.
But it's moments like this that make me glad I'm mostly retired.
The elephant in the room is that this is a kernel mode driver - which is why the blast radius is huge and recovery options limited. Ironically, besides the risk of shipping come that crashes is the risk of shipping a security vulnerability.
Apple has very sensibly killed off 3rd party kernel-mode drivers, and products for MacOS are presumably not implemented with them.
Had a vendor tell us today (unrelated to the CrowdStrike debacle) that they needed to come delete some soon-to-be-obsolete third-party software off our servers because the company that uses them like a marionette said they had to. Given that (1) we bought a perpetual license to that software, so we can run it as long as we like even if it won't have support in the future, and (2) the puppetmaster company is now offering replacement software that is notably inferior, not really fully ready for production use, and expensive, it's a wonder the sysadmin's response was printable. (It was a surprisingly polite "uh, no, I don't think so.")
So yes, sometimes the vendor tries to MAKE you do what's best for them rather than for you.
If one thing can take down the world, or a significant part of it, we've now discovered that its foundations were not as solid as they seemed.
Something is very, very wrong if a single, simple update can paralyse so many of the systems we rely upon.
Fixing this particular problem won't fix THE problem - there are vulnerabilities built into everything, and it doesn't need a hacker group to find them, it just needs the system builders to f*ck something up.
If nothing changes, something like this will happen again.
On two different Dell laptops today, we discovered we CAN'T boot to safe mode. Windows won't boot, so can't use that to initiate safe mode. (Terrible design decision there.) The Dell boot options simply don't give us the opportunity (just hardware diagnostics, which pass fine, regular boot, or full-recovery-mode to wipe and reinstall). Even turning off the machine during Windows startup, twice, doesn't work - instead of going to the Windows boot options, it goes back to the Dell ones.
Apparently the solution is to boot to some other media, then use that to delete the offending file.
Okay, let's see...
1) How is it, after the AWS S3 config fiasco, that companies are allowing circular dependencies in their system recovery process?
2) How is it, ever, that admins of large fleets of machines are allowing all-or-nothing updates from ANY source?
3) How is it, that an AV provider, as a virtual admin of perhaps millions of machines, is pushing updates to all devices at the same time? I get that you don't want to advertise to the bad guys, but we're still safe if the rollout were smeared out over a couple of hours. Of course, this assumes that canaries are being used to stop a rollout if a significant number of machines are borking.
Given how AV/DLP has to work, and our lack of knowledge about exactly what went wrong, I'm hard pressed to fault anyone for pushing out an update that borks some percentage of machines. But the fact that so many machines were borked--that's really my question (for CS). But at the same time, I have extremely little sympathy for someone who is managing 100k machines but not bothered to wargame this scenario. I mean seriously, when was the last 12 month period that a Microsoft OS update didn't bork a lot of machines?
It seems that nobody in the wider media is apportioning a lot of blame on Microsoft.
They've been, obviously, very quiet about all of this. The media focus is all on CrowdStrike.
It should be noted that falcon hasn't impacted macOS or Linux users.
The corporate I work for uses Crowdstrike across all operating systems - my work issued mac has falcond running.
I can still use my mac to do my day job.
The fact that Windows BSOD's due to a third party service failure is the real news story here.
Who the hell thinks it's a good idea to NOT bother to code in a failsafe scenario to cover a third party AV service provider failure?
It's coding 101 - or it should be.
One of the defining laws of software is Fail Gracefully.
A complete failure of the OS to boot and for the workaround to be manual intervention on a machine by machine basis?
That beggars belief.
Why are so few people mentioning this glaringly obvious issue?
Nope, that went away with Vista IIRC (definitely gone by W8). Now you have to wait for Windows to "notice" it's not booted properly (handled as part of UEFI boot) several times (default is 3 IIRC) or reboot holding down a key (either CTRL or shift, can't remember which) and it'll take you to the boot menu next time it boots up.
But, but, but my phone won't do anything when I press this and Ryanair customer service won't like talk to me and now they're saying they've got to do stuff on paper, but there's 2 queues.... I'm just trying to go on holiday. OMG end o' the world.
why does windows not have an immutable OS partition and a security partition for these things?
absolute pain
finding the borked systems, attaching the boot medium, then getting to the cli to run the commands. repeat thousands of times.
will management appreciate us more now they see the impact of not getting things done right and dedication of teams to correct stuff?
Done my bit now, off to the weekend!!
"It is very disturbing that a single AV update can take down more machines than a global denial of service attack."
It's very disturbing that any IT department thinks it's a good idea to rely on mindlessly pushing out updates for a system which creates a single point of failure for large sections of the economy.
When I woke up to blearily hearing the radio saying "CrowdStrike" had taken out loads of IT systems, I naturally concluded that CloudStrike was today's DoS virus attack, and was primed to expect legions of IT engineers to delete, destroy, exterminate, eliminate CloudStrike.
Now it turns out to be an actual systems product. WTF names their software as though it is malicious software?
I must have been missing something; why would you use Bitlocker encryption on a server that is already (one would hope) physically secure? I encrypt the disks in my laptops with LUKS, since being laptops they might end up being left on the tube after a wet night out, and desktops because burglary etc, but I have never considered that encrypting my server disks might also be important. I've mainly regarded disk encryption to be valuable where unauthorised physical access might be an issue, but perhaps I need to think again? In my (admittedly limited) view the main threat to a server comes via its network connection, and possible vulnerabilities to network originated attacks in the software running on them - but from that end the disks will already be decrypted anyway, right?
I guess if you're worried about what the law might find on your servers in the event of a legal... situation... then perhaps it makes sense. But in that case won't you be required to hand over the keys anyway? I wouldn't know, because I'm not involved in any potentially criminal activity. Are they?
I must have been missing something; why would you use Bitlocker encryption on a server that is already (one would hope) physically secure?
Because so many large companies have been getting their asses handed to them by hackers that have broken into their systems. PCI and their insurance companies are beating them over the head for 'encryption at rest'. And BitLocker is the fastest, cheapest way to get there. Manglement doesn't understand that as soon as the OS boots, the data is unencrypted for anyone that can get logged into the system. To them, all their stored documents and free databases that don't support encryption are now safe.
Other threats mostly revolve around your storage media leaving site. Disk or tape failures* will mean that from time to time drives or cartridges will leave site and you'll have no idea where they could end up. Sure, you could have a policy and process that means crushing or shredding but if the media is encrypted, it is no longer something to worry about and it ticks that all important compliance item on the audit.
* A failed disk or tape isn't necessarily unreadable - it may well have just exceeded some read/write error threshold.
The BBC News bod has just said "This only affects people running Microsoft".
Microsoft what? Microsoft Word? Microsoft Teams? Microsoft Active Directory? Which Microsoft product?
It's like saying "this issue only affects Electrolux" AARGHH!!! So my fridge is going to destroy my laundry????
So many stupid people posting in the comments here. Yes Crowdstrike messed up BUT the real villain here is Microsoft. If their crappy OS can be toppled by any app running on the Windows platform then questions need to be asked. Yet people want to throw all the blame at Crowdstrike. Seriously cop on everyone!!
Exactly. Through out yesterday I've had people saying I should stop blaming both Microsoft or Microsoft & Crowdstrike because it's all Crowdstrike's fault.
To me it's both to blame:
* CS for having the ability to just apply an update automatically & seemingly without any QA.
* MS for not having that part of the OS being resilient to said updates.
I now know CS is available for Linux & MacOS but (for Linux) it uses existing functionality to hook into the kernel, so the dodgy update wouldn't cause the system to fail.
Note: I've never heard of CrowdStrike before yesterday morning (UK time).
I’m not defending MS - I look forward to the day when they burn in hell - but my understanding is that the crowdstrike software wedges itself into the boot process; it runs BEFORE windows boots up. So it’s not really window’s fault; the same mechanism would bugger-up ANY OS.
It is very much MS’s fault for making an OS that is so crappy and full of holes that stuff like Crowdstrike is considered necessary in the first place though.
We were quite lucky as we don't immediately patch but wait a few days for others to test them.
There was an element of smugness today over the Crowdstrike strike not affecting directly.
Unfortunately we send/receive stuff from other partners whose infra was affected. As a result data ingestion fell off a cliff as they couldn't send us.
After they recover, we'll be hit by barrage of data in the catchup.
Null pointer, per this chap's inspection of a stack trace dump.
So it tries to access part of a system driver and Windows throws its wobbly.
Since this hasn't happened before, this seems to imply that the "channel" file is not simple data but either code, or code-config used to build & run code @ runtime.
Any number of ways but completely extraneous to this: that's not what happened. The system driver wasn't running to generate the stack trace dump; the CrowdStrike code was running.
The CrowdStrike code sought to access memory in an OS-protected area; in this case, apparently, a system driver. The OS memory-protection kicked in and shut down the OS.
The system driver was not the trigger, was not "running" to cause the problem.
Analogy-rewrite of your question: "He attacked a shopkeeper and the police stepped in. Why did the shopkeeper DO this to us!?!" :D
Official instructions say to "reboot (again) into Advanced mode, and then choose Safe Mode"
Why bother, when there's a choice for Dos Prompt, RIGHT THERE. (And, if it's an old server your working on with a KVM crash cart, save yourself anywhere from 4 to 7 min rebooting / checking memory, probing for boot drive / starting at the UEFI/BIOS screen, before getting to the menu where you can choose "Safe Mode")
Instead of all that, Click on DOS Prompt, select Administrator, enter password, and poof! At an X prompt. (Well, as long as you're not using BitLocker.)
C: or D: or get to whichever partition has windows on it (can do a dir to make sure you're on the right one)
cd Windows\System32\drivers\CrowdStrike
del C-00000291*.sys
and, if there's no error message, you're done.
Type Exit, then click to start windows.
Can save up to 10 minutes per older server. And, I've not seen this, anywhere.
Guess Dos Prompts are too scary these days.
Well, all this has been a useful reminder to once more check my own disaster recovery plans (which basically involve booting off a USB stick running the (non-Microsoft) backup/restore software and restoring the OS partition, where all my data is on a non-C: partition), plus verifying I had disabled Bitlocker (I had), plus telling my rellies to do the same, but also to check the stupid Win 10 removal of Safe Mode bootability which as it's not part of my DR plan I hadn't realised thatbwas a potential issue. All whilst mindfully reminding myself that any trace of smugness often precedes a downfall.
Paranoid? Moi? Oui! It's the safest way.
Having had a little time to look at this (thankfully we aren't huge crowdstrike users and only had a few test machines with it on) it seems to me that Microsoft could have made this far, far less of an issue.
Why, when the same driver file is repeatedly causing a boot failure (and Windows clearly knows what's causing the failure, it's right there on the BSOD) does their 'automatic repair' process not simply block the driver from loading?
And where did the old 'last known good' boot option go? Was it perfect, no, but it's a hell of a lot better than rebuilding a whole OS or talking thousands of users through a less than intuitive recovery process.
I feel that while Microsoft aren't to blame for the outage they certainly could, and should, have made recovering from such an issue far easier.
My work laptop was affected by this as I tend to leave it on in standby. So when I got up yesterday it was in recovery and on rebooting it BSODd with csagent.sys as the cause.
Our IT department were out all over the place fixing servers and business critical systems yesterday, so any user support for this problem just wasn’t going to happen, rightly so.
Microsoft posted an interesting recommendation (I think it was on their azure blog or forum) to reboot “up to 15 times” which would eventually allow Crowdstrike to pull the fixed update. I was sceptical but tried it - after 8 reboots I got to the login screen and the network connected (which it hadn’t done previously).
Windows has a mechanism to skip loading drivers that prevent boot after a number of failures and this was seemingly triggered after 8 reboots.
Trying to login immediately resulted in the same BSOD (I assume the driver was loaded on login too). So I did the 8 reboots again, got to the login screen again with network connectivity and just left it there for half an hour. Then I logged in and all was well.
So the fix doesn’t necessarily involve hands on - any user can perform 8 reboots and leave it on the login screen for a bit for the fixed update to be pulled. Hopefully this helps someone!
I don’t understand why it isn’t more common knowledge or hasn’t been picked up by the tech media, although that might require some actual journalism instead of melodramatic pontification.
It wasn’t a coincidence that it suddenly got to the login screen and the network connected after 8 reboots when it had crashed before that consistently - then immediately crashing again at login when the driver loaded. It’s clear that Windows declined to load the driver at boot because of the boot failures. Nothing to do with a “race condition” between receiving the good update and csagent parsing the faulty file.
There's an important element here called trust. After this , who will reasonably be trusting CrowdStrike ? This is going to cost companies a hell of a lot of money. Just take the group of travellers that will be claiming compensation for their delays , stays in hotels , restaurant bills etc .. We're talking tens of billions in monetary damages. Who can anyone trust them after this is a big question. Will they take chances and stick with them ? Personally i wouldn't. Too risky. To have a fix will not be enough to recover from the disaster they are responsible for.
Those who still trust MS after many decades of evidence to the contrary?
Reports will be written, "lessons will be learned", and folks will go back to the same old shit again. Fundamentally there are several issues here, but the dependency on specific vendors will mean the cost & trouble of proper fixes is too much. MS know that, as to AV suppliers.
Currently the priority is 110% marketing, not quality.
So they need to beef up their QA.
And as for the marketing: Why don't they concentrate on actually good features? Deduplication (since Server 2012, working very well), SMB compression, robocopy /iorate, shadowcopies (with better defaults for client OS than now) and so on. But no, paint gets layers and 3D, but so half done noone actually uses it.
Imagine a world where a single software update is the butterfly that flaps its wings and causes a hurricane in the digital economy. Hospitals go on a coffee break, banks play hide and seek, airports take an unexpected flight, and businesses everywhere hit the 'pause' button. And in the midst of this chaos, someone suggests a digital currency? That's like bringing a virtual knife to a real gunfight. Long live the king – cold, hard cash!
Thesis: Crowdstrike is not worth 93 billion dollars (at time of writing).
Fear: ‘CrowdStrike is an enterprise-grade employee spying app masquerading as a cloud application observability dashboard.’
Quote from Microsoft "says the incident highlights how important it is for companies such as CrowdStrike to use quality control checks on updates before sending them out.It’s also a reminder of how important it is for all of us across the tech ecosystem to prioritize operating with safe deployment and disaster recovery using the mechanisms that exist,” I needed a new keyboard after reading this.
Crowdstrike got popular cos it was sold to higher level exec people who had no idea..
That multiplied, like with anything IT the more that have it, the more higher up people want it.
(proper IT Crowd fodder)
One borked update taking down so much stuff (a whole bunch of the 8.5m would be servers, which might be serving hundreds/thousands of people, who aren't directly locally affected), is very much not good. But the specifics of removing it is a bag of wtf for IT admins.
How did this not get found during any stage of testing? As an IT person, I would hope something that is potentially going to affect your entire customer base (of lots of large key customers), should have the s**t tested out of it..
Re "Talking a warehouse operator through the intricacies of BitLocker recovery keys and command prompts is not for the faint-hearted!"
It's also not great having to give said warehouse operator access to an account with admin rights on the machine. That said, when the machine is back online and talking to whatever management system you have, you can change it.