back to article Azure VMs ruined by CrowdStrike patchpocalypse? Microsoft has recovery tips

Did the CrowdStrike patchpocalypse knock your Azure VMs into a BSOD boot loop? If so, Microsoft has some tips to get them back online. It's believed that a bad channel file for CrowdStrike's endpoint security solution Falcon caused its Sensor active detection agent to attack its host. That's caused Windows machines around the …

  1. Noonoot

    Someone is going to get their ass kicked

    No comment

    1. heyrick Silver badge

      Re: Someone is going to get their ass kicked

      Everybody who authorised (*) having unverified (by their company's IT) updates self-installing on mission critical systems?

      * - careful wording as it's often the bean counters rather than omissions by IT.

      1. Anonymous Coward
        Anonymous Coward

        Re: Someone is going to get their ass kicked

        Or because your Windows updates were migrated over to your MSP by the powers that be, where you lost all control. Despite asking for filtering you get told "No, as then we'd have to do it for all companies we support" and then when asked "Well do you at least delay the updates a few weeks to make sure they're tested?" and told "Yes" but this turns out to be a lie.

        Luckily we never got hit.

        1. LessWileyCoyote

          Re: Someone is going to get their ass kicked

          Whenever you're given a really, really stupid order, ask for written confirmation. Saved my behind when an order to put a one-line patch live without testing brought down a key government system for three days.

          (The patch was fine. It was the undocumented dicking around in unallocated memory that it unintentionally trampled that was the problem. But at least it found it).

      2. Denarius
        Unhappy

        Re: Someone is going to get their ass kicked

        some minor flunky maybe. But some failed senior manglement will be kicked upstairs and one or two will get incredible golden parachutes for saving $5K and costing billions. This, after ElReg story about $MS being called inadequate by , of all things, USA bureaucrat.

    2. teknopaul

      Re: Someone is going to get their ass kicked

      Or not.

      No such thing as corporate responsibility.

      "Restore from backup" implying no-one has any actual data in Microsoft PCs. Probably generally true. Keep your actual data on Linux file servers, db's, and the cloud which usually is some flavour of Unix + no you can't autoupdate me to death.

      Pity the folk with Window nodes in the cloud or sqlserver.

  2. Anonymous Coward
    Anonymous Coward

    good luck booting in to safe mode for VM's in Azure! A big shitty issue with VM's in Azure is the lack of a proper console you're stuck with the serial console when things to titsup at bootup which is pi$$ poor! On prem with esx, hyper-V, AHV or whatever hypervisor you have, and you have a connection issue to a VM just bat up the console

    1. Anonymous Coward
      Anonymous Coward

      Seriously why run a VM

      Unbelievable that in this day and age you want to run a VM in the cloud, what's next Run a z900 or midrange ? Seriously , wake up, infra and sys admins are not needed, we need more software engineers and data scientists, not button pushers.

      1. Anonymous Coward
        Anonymous Coward

        Re: Seriously why run a VM

        Because you can have multiple redundant hosts that pass the VMs' data between them, allowing failover to take place in a fraction of a second.

        1. cozappz

          Re: Seriously why run a VM

          There is a pod for that in kubernetes.

    2. Unicornpiss
      Meh

      I'm guessing Microsoft has a proper VM console

      ..that is just not shared with Azure customers.

  3. Anonymous Coward
    Anonymous Coward

    I heard deleting the bios resolves the problem. If that fails remove all physical drives then submerge it in water, 30 minutes should do the trick. Only use fire as a last resort.

    I jest ofc but I do love the old turn it off and on again. The multiple and up to 15 times just makes it even crazier. I feel sorry for those that are having to deal with this right now. Who would have thought an update pushed out on a Friday (or just before) could cause so much chaos? When will humanity ever learn not to do this?

    1. Flywheel
      Facepalm

      I heard deleting the bios resolves the problem.

      Wow! I read that as "deleting the boss" - maybe that would be a better bet?

      1. Doctor Syntax Silver badge
    2. Doctor Syntax Silver badge

      "The multiple and up to 15 times just makes it even crazier."

      It makes the disclaimer easier - if it hasn't righted itself yet then you just haven't tried enough times.

    3. richardcox13

      > The multiple and up to 15 times just makes it even crazier.

      But does make sense. The theory is that eventually the network stack gets enough time before the next BSOD to have updated to the latest files, which don't trigger the BSOD.

      I can also imagine this very much depends on the machine's internet connectivity being very low latency and very high bandwidth.

      1. Anonymous Coward
        Anonymous Coward

        is this not in the cloud? Should that be an issue? If I was to guess it would be something to do with the propagation of the update.

        1. richardcox13

          Depends.

          If the server rebooting has 0.1s in which to pull down the update before the bad previous update crashes, then 10ms extra latency may be significant. While n/w roundtrip may be measured in microseconds within the DC, it will be milliseconds to another DC.

      2. Anonymous Coward
        Anonymous Coward

        possibly shutting down cores to make the system boot as a single core might help resolve the race. once the system is fixed restart with all processors.

    4. Wayland

      Submerge it in seawater.

    5. John Brown (no body) Silver badge

      "The multiple and up to 15 times just makes it even crazier."

      It's how you get them off the phone, and, if the winds blowing in the right direction, might even fix the issue. Asking them to do it 15 times in a row is a sign of desperation where the extra few minutes of a re-boot isn't going to buy you enough time. If you still need more time, tell them to start again, but 20 times this time :-)

  4. Omnipresent Silver badge

    Ya know?

    I really don't know why we are so concerned about the tinkering monkeys stupidly destroying themselves with technology. The world will carry on without them.

    1. John Savard

      Re: Ya know?

      I wasn't aware that anyone else but us humans was able to understand English and post on these forums. We humans are concerned about the survival of the human race because it's the one we belong to, and thus its survival is required for our own personal survival.

      1. heyrick Silver badge

        Re: Ya know?

        I'm old enough and cynical enough that I'd happily drink a toast to the end of the shitshow that passes for humanity and raise another glass to our new feline overlords.

        1. Khaptain Silver badge

          Re: Ya know?

          "I'm old enough and cynical enough that I'd happily drink a toast to the end of the shitshow that passes for humanity and raise another glass to our new feline overlords."

          You mean the Furries ?

        2. TimMaher Silver badge
          Coat

          Re: feline overlords

          No, no! It’s not them. It’s the white mice.

          Mines the one with the towel in the pocket.

          1. TDog

            Re: feline overlords

            +1 for Hitchikers reference

      2. TheMaskedMan Silver badge

        Re: Ya know?

        "We humans are concerned about the survival of the human race because it's the one we belong to, and thus its survival is required for our own personal survival."

        True, but I can think of quite a few individuals that the race could easily manage without.

      3. John Brown (no body) Silver badge
        Joke

        Re: Ya know?

        "I wasn't aware that anyone else but us humans was able to understand English and post on these forums."

        And that's just how we like it. Carry on old chap, nothing to see here :-)

  5. glennsills

    Human intervention is needed.

    The problem here is that people who own/lease Windows systems, including VMs, are automatically updating software that can bring those systems down. This automated updating system saves tons of money in payroll expenses, but a simple, "Let's try this release on a sandbox and see how things work before we roll it out to every system" would be prudent. If you are an airline, and airport, a healthcare system (NHS, really?), or any other system that must run continuously, you shouldn't be trusting your vendors to be perfect. I understand that business want to save money, but until businesses that own software take some responsibility for maintenance of that software, this kind of problem will continue to happen.

    And no, having AI check out the software is not a solution. :-)

    1. Anonymous Coward
      Anonymous Coward

      Re: Human intervention is needed.

      True but places like the NHS, you have people in charge in IT that won't listen to their engineers. They appear to think they know best. Like the directors who took bribes from HP back in 2008 to go with HP laptops. The managers who ignored my requests for a HDD crusher of our own, then there was a big breach with sold HDDs. Its always someone else fault when things go wrong, but their idea when things go right so they can get their bullshit promotion.

      Peter Principle comes to mind.

      1. PB90210 Silver badge

        Re: Human intervention is needed.

        The cream always rises to the top... but then so does scum!

    2. hoola Silver badge

      Re: Human intervention is needed.

      Yet on other topics people are screaming that systems are not updating. The trouble with AV, particularly a nebulous cloud based pile of shite is you have no control. That is what a "modern " solution is sold on. Always up to date, no need to interact or manage stuff.

    3. Khaptain Silver badge

      Re: Human intervention is needed.

      "Let's try this release on a sandbox and see how things work before we roll it out to every system" would be prudent."

      That can only work in some circumstances, definitely not all. And how long would you test for, 1 day, 2 day, 1 week, 1 month not all bugs show up immediately.

      It's simply not viable to test all platforms in all languages for all the software in your systems. In fact it's near to impossible. Most outfits simply don't have the material, the time or the need.

      1. teknopaul

        Re: Human intervention is needed.

        It's not up to "most outfits" it's mega Corp pushing to everyone and not monitoring outcomes.

        This looks to me like: push stuff and don't test the result.

        Not humans but software that does that.

        I'm suspect crowdstrike didn't think of testing at all, and don't have a mechanism to test the results of an update. I know of 2 similar botched updates by crowdstrike. They struggle to fix and they themselves dont know when it's happening and don't have a kill switch for a bothched rollout that's ongoing.

        As we now all know: their internal testing before rollout is piss poor too.

      2. druck Silver badge

        Re: Human intervention is needed.

        Whilst more subtle problems may not show up for a few weeks, detecting a boot loop is pretty quick, so one machine in a sandbox would have caught this straight away.

    4. teknopaul

      Re: Human intervention is needed.

      Try it out first does not sem ti be a option for crowdstrike. The struck off all our laptops about two ago. Pushed an update that até 100% across s the company and some how neither our It not Microsoft have the perms to kill the thing so we could work.

      2 days to fix. Then this.

      Ie Crowdstrike issues are sistemic. Move off ASAP if you can.

      1. druck Silver badge

        Re: Human intervention is needed.

        But it's not just Crowdstrike, think about the OS which makes this possible - "systemic" barely scratches the surface of that clusterfuck.

    5. teknopaul

      Re: Human intervention is needed.

      You dn t need AI to check something is wrong. Recent crowdstrike bork had crowdstrike.exe at 100% cpu.

      We deploy code all the time on our own systems and don't miss something obvious like that.

      Neither should mega Corp on consumer hardware. It's basic QA. I suspect negligence here rather than the one messy update excuse they are peddling.

      Monitoring is not something technically difficult to do with simple code, but while on the subject: AI can be better at pattern recognition and find things humans might miss. Eg cpu spike 37 mins after deploy.

      Crowdstrike have not got as far as if (cpu >= 100)

      I suspect the6 don't have any mechanism to test rollouts. Spendimg their money on cutting costs and scaling up rather than quality.

  6. alcomatt

    I guess it detected windows, and blocked this malware. Product working as intended.... ;-)

    1. hoola Silver badge

      Looking at it more objectively there are several possibilities:

      1. Very little is deployed on Linux

      2. Simple luck that it was not affected.

      I am always stunned at how few people use AV on Linux server or desktop and Mac. They are not invulnerable and the attact vectors are moving to software not the OS.

  7. 502 bad gateway
    Pint

    Life imitates art

    Flippantly I was quoting Roy: "Hello, IT. Have you tried turning it off and on again?" earlier today... how perverse it is that I was unintentionally providing official support advice.

    Thank Frick it's Friday, we can have a nice cool beer on the way home

  8. Anonymous Coward
    Anonymous Coward

    Tut tut ... letting your real feeling out !!!

    " ... That's caused Windows machines around the world to become even less useful"

    Meeeow !!!

    :)

    P.S. I wonder if the person responsible for the $12.5bn (as at market opening) slipup is still working for Crowdstrike !!!???

    P.P.S. Good time to buy Crowdstrike cheap as they will recover because there is no quick/easy/cheap alternative !!!

    Also good time to get into Coffee, Pharma Companies (particularly Migraine Meds) and Keyboards (lots are going to be worn out !!!)

    If this market play pays good, please let me know !!! :)

  9. Doctor Syntax Silver badge

    "First, if you have a backup from before 1900 UTC yesterday, just restore that."

    And accept that you lost any work, orders taken, despatches made, whatever, since then.

    1. Anonymous Coward
      Anonymous Coward

      depends if you have anything on the box that changes state, for a large number of bog stand application servers it probably doesn't matter

      1. richardcox13

        Equally if it is a database server you then restore the most recent DB backup.... if your DB is not been backed (could be incremental and/or log) up every few minutes your business data is very vulnerable to loss.

        1. Doctor Syntax Silver badge

          Yes. If you have incremental backups you'll be fine. But my point, which I maybe didn't make plain, was that just restoring a pre-1900hrs backup will, on its own, take you back to the state of work then. It will take further action to recover subsequent transactions, whether by restoring incremental backups or by manual re-entry. If, however, data was entered, say from a website, and there was no separate backup of that then it will have been lost. Is there a chance of recovering information from email acknowledgements or did the restore overwrite outbound emails? The system may even start handing out order numbers duplicating those issued between backup and outage.

          Restoring a backup image is only the start of getting back.

    2. Alister

      This is why you have a separate disk for the O/S, and don't allow any application to put any data there.

      We have successfully restored a number of Windows IIS and SQL servers today, by just rolling back the O/S disk to las night's snapshot. The data disks were not replaced, and so they are still current.

      1. LyingMan
        Pint

        Us too. Some of ours are in Azure as SaaS (managed instances of sql server and a bunch of small sql dbs) and the rest are on-prem (application and database servers)

        Azure ones survived and a few in on-prem died. Yet to check why they survived but the dead ones were revived by restoring the OS disk. Worked a treat.

        Kudos to the consultant who did it that way.

  10. furrydingbat

    Going to make a great "Who, me?"

  11. vtcodger Silver badge

    A question or two.

    "First, if you have a backup from before 1900 UTC yesterday, just restore that. If your backup habits are lax, then you're going to have to repair the OS disk offline."

    Fortunately, I'm long since retired. And, I quit using Windows a LONG time ago after concluding that the system was far too buggy and poorly documented for serious use. And in any case, the idea of automatically loaded updates, has always seemed quite Utopian to me. I mean, like what could possibly go wrong? Aside from supply chain attacks? And quality control problems in agencies you have no control over? And a huge exposure surface for sophisticated national agents to attack if (likely when rather than if) international tensions boil over.

    An accident waiting to happen if you ask me.

    But, no matter. I do have a couple of questions about this particular ... ahem ... "situation".

    1. If your system, virtual or real, is stuck in a boot loop, how the heck do you load this here backup?

    2. Are you going to lose all the transactions entered after the last backup? Isn't that going to be a substantial problem for many businesses/organizations? After all, a lot of outfits purportedly use their computers to sell stuff, and/or buy stuff, and/or to keep track of things like attendance, work hours, medication lists, nuclear warhead inventories. Mundane stuff like that. Of course, if the computers are only there so the bosses can play Solitaire and send emails between important phone calls/meetings,maybe it doesn't matter all that much.

    1. Jou (Mxyzptlk) Silver badge

      Re: A question or two.

      > 1. If your system, virtual or real, is stuck in a boot loop, how the heck do you load this here backup?

      You say you've been quitting M$ and use something else, and I am astonished to read this. Ever heard of booting of Floppy Disk/CD/USB/network and just restore?

      > 2. Are you going to lose all the transactions entered after the last backup?

      Are your data partitions for your databases = system partition? You don't even need a separate (v)disk, just a separate partition is enough. Just a few years ago you had your drive A: for booting the OS, and drive B: for your data (What is a C: ? Oh, that newfangled stuff, I read about it recently, won't be here for decades...)

  12. TDog

    How many wake up calls do we need?

    And even more terrifyingling, How many have we got left?

    1. ecofeco Silver badge
      Windows

      Re: How many wake up calls do we need?

      History shows that none are needed because we didn't pay attention to the first 10,000.

      How many left is moot for the same reason.

  13. John Savard

    Safe Mode

    If it's hard to boot an Azure VM into Safe Mode, and this is what's needed to easily recover from this issue... then surelky Microsoft should be rushing to add that feature to Azure?

    1. nonpc

      Re: Safe Mode

      I don't have Azure experience but with VMWare you just roll back to the previous snapshot.

  14. teknopaul

    On a serious note

    Is it safe to turn on Windows machines that have been down while this was going on?

    Does anyone at CrowdStrike work weekends?

  15. teknopaul

    BSOD

    Technically how can the consequence of a botched Windows update be a BSOD in a running machine?

    Doesn't this require hot installing new drivers that somehow manage to kill a 30 year old microkernel? Shouldnt that be impossible by now?

    1. Jou (Mxyzptlk) Silver badge

      Re: BSOD

      Don't you remember? They weakened the separation for Windows NT 4.0, for performance reasons. They should have tweaked the disk caching mechanism though. It is still as bad as it was back then, there are just some mitigations active so the cache does not take more RAM than free (and excluding the SWAP in that calculation).

    2. jotheberlock

      Re: BSOD

      Windows hasn't even pretended to be a microkernel in a long time but as far as I understand it? you would only see the problem after installing the update and rebooting.

      1. Jou (Mxyzptlk) Silver badge

        Re: BSOD

        Windows NT did, up to version 3.51

  16. Anonymous Anti-ANC South African Coward Silver badge
    Joke

    Back through the Time Warp again...

    Time to do the time warp and go back to Windows NT 4 SP6... or Windows 2000 SP4...

    Never need updates, never will get any updates...

    /runs for them thar hills

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like