back to article Microsoft admits slim staff and broken automation contributed to Azure outage

Microsoft's preliminary analysis of an incident that took out its Australia East cloud region last week – and which appears also to have caused trouble for Oracle – attributes the incident in part to insufficient staff numbers on site, slowing recovery efforts. The software colossus has blamed the incident on "a utility power …

  1. Blane Bramble

    Sounds like just another day in a data centre. Will done Microsoft for being honest and transparent.

    1. Steve Davies 3 Silver badge

      The words

      Microsoft

      Honest

      Transparent

      DO NOT BELONG in the same sentence together. MS is neither Honest or Transparent as we know from over 30 years of bitter experience.

      1. chip66

        Re: The words

        More like 40 years. MS, beginning with DOS, was all about stealing code and deceiving customers. Nothing has or will change with MS.

  2. Timto

    Nostalgia

    The internet was better in the 90s

    1. Potemkine! Silver badge

      Re: Nostalgia

      Euh.... I don't miss V.90.

  3. CowHorseFrog Silver badge

    If only Microsoft had some spare millions for support staff. Oh wait they gave that to Nadella instead of actually manning data centers

  4. breakfast Silver badge

    Unusual weather

    Unusual to have a single flash of lightning and then the cloud is completely gone.

    1. Anonymous Coward
      Anonymous Coward

      Re: Unusual weather

      Yeah, the cloud pisses on you before going away.

      Can we make MS go away? Please... pretty please...

  5. Jellied Eel Silver badge

    Microsoft, heal thyself..

    Storage hardware damaged by the data hall temperatures "required extensive troubleshooting" but Microsoft's diagnostic tools could not find relevant data because the storage servers were down.

    Can they ever? I have a PC that bluescreens when waking from sleep. Rather than telling me anything useful, it gives me a QR code that probably takes me to a web page that probably doesn't tell me anything either. And might be on a server in the datacentre that's currently offline.

    So no wonder the SOP for Microsoft diagnostic & trouble shooting is still turn it on/off and hope for the best.

    1. Pascal Monett Silver badge

      Borkzilla doesn't know how to manage sleep.

      My SOP when getting a new computer is to modify the standard power config to forbid sleep and hibernation.

      As far as I'm concerned, when I need that PC, it is supposed to be awake and ready to roll. If I don't need it, I shut it down. Especially these days when boot times are livable.

    2. trindflo Silver badge
      Boffin

      Bluescreens

      Windows bluescreens are always a driver problem, and most of the time is a bad memory reference. Sometimes the driver requires bad hardware to flex its nethers, but if it happens on waking from sleep it probably doesn't need any help from the hardware and it just doesn't know how to wake up properly.

      I use BlueScreenView to get an idea of what is happening (unless I wrote the driver, in which case Windbg gives a lot more information). BlueScreenView is a portable app (you don't need to install it). It will identify the problem driver. If you can live without the hardware disable it in Device Manager.

      1. phuzz Silver badge

        Re: Bluescreens

        BSoD's are usually bad drivers, when they're not bad hardware (dodgy RAM will crash any software), but badly written antivirus or anticheat can also be the problem.

        1. trindflo Silver badge

          Re: Bluescreens

          You have a point that hardware can cause BSODs when the driver is written correctly: RAM as you mention, overclocking, failing to install the heatsink properly on the CPU, power supply. Those will have a random source.

          If the trap is consistently in the same driver, it is the driver. I've not seen antivirus or anticheat bugs cause bluescreens without installing a companion driver - if something in ring 3 can cause a BSoD by using a callback incorrectly or sending bad data to the driver, I'd still say that is the fault of the driver.

  6. simonb_london

    What? It was an outage? I thought it was just the response time.

  7. martinusher Silver badge

    Everything is perfect, until it isn't

    I've been caught by this. The problem is that when everything's running smoothly the bean counters start looking at you as a waste of money, head count and resources to be slimmed down. Since things don't break immediately their efforts produce immediate rewards (which they'll pocket a good bit of because they're so clever to notice where they can make those savings). Then something goes wrong.

    At that point you'll discover that there really is surplus manpower. It will be meetings and committees convened to point the finger of blame at whoever is responsible.

    The lessons learned from this is that you need just enough breakage / slippage in the work to ensure that everyone looks busy enough.

    1. froggreatest

      Re: Everything is perfect, until it isn't

      Bean counters are unaware of the failures per se. This is all to do with the management who green-light the fact that it is fine to have a few people. I believe one of the lessons here is that there should be a minimum of x people per some y size datacenter.

  8. Mike Pellatt

    Always

    Routinely

    Test

    Your

    Resilient

    Infrastructure

    to check it is actually, you know, resilient.

    1. DS999 Silver badge

      Even if both spare chillers worked instead of just one they would have had an outage. The problem was the five chillers that wouldn't restart after a power sag. You can test power loss scenarios by flipping breakers, but power sags/surges are another matter.

      I wonder if they rotate active/standby chillers regularly so that they all get their startup regularly tested and even out the wear? Perhaps that might have caught whatever caused the second standby chiller to not start. At least that would have bought them a bit more time before things got too hot.

  9. Phil Kingston

    So small numbers of staff for complex systems had horrible consequences?

    Might tell my boss who seems committed to resourcing by a simple "man hours-to-servers ratio", even when the systems are legacy, complex and mission-critical to some fairly big organisations.

    1. Pascal Monett Silver badge

      Nah. Wait for the failure that will educate him.

      If he's intelligent enough for that . . .

  10. deadlockvictim

    Problems

    Welcome to The Cloud, where our problems very much become your problems.

    We hope that you didn't lose too much in revenue because we sure as hell are not reimbursing you.

    You are the fool who signed up to The Cloud.

  11. hammarbtyp

    A.i

    I thought it was all run by chatgpt nowadays in this brave new world

  12. Claptrap314 Silver badge

    So...when was the last test of the redundant cooling systems, Microsoft? Hmm?

    Total failure of reliability engineering.

    1. Roo
      Windows

      The Cloud Vendors keep telling us that they offer a cheaper* option through economies of scale (and better utilization of hardware).

      The flip side of this is that they have less redundant kit lying around to run a proper test, and if the test goes wrong (eg: cooling systems) then the scale of the outage is much bigger and the time & effort to recover can also be geometrically scaled accordingly. Also because of the resources are so readily interchangeable and by necessity interlinked failures will cascade fairly readily as well. So the downside of a test going wrong is hugely expensive for them (and their customers), and arguably it's probably cheaper and easier to assume everything will be fine instead of causing outages deliberately.

      Personally I like a bit of fat in the system and the ability to tightly contain failures, but the folks paying the dosh like zero fat, moving fast, breaking things, and failing big.

      1. trindflo Silver badge

        Economies of scale

        Economies of scale does amplify the disadvantages along with the advantages. I think the bigger issue is businesses that prioritize charging rents over providing services. A business that can get away with prioritizing profits to the point where the services fail has no effective competition.

        "staffing of the team at night was insufficient [...] We have temporarily increased the team size"

        That reeks of window dressing.

  13. mikus

    Well, who can blame them as a small business, one has to make sure to right-size their staffing accordingly. They can't afford to do it like those real "large" IT companies do it. Maybe those relying on this solution should shop for a more appropriate sized organization and product offering that can actually meet their needs for redundancy and scalability.

  14. Anonymous Coward
    Anonymous Coward

    The Cloud

    Other people's computers you have no control over

  15. Pascal Monett Silver badge

    "We have temporarily increased the team size "

    Temporarily.

    So, the next time there's a "power sag", you'll be in exactly the same position again ?

    That's manglement for you. Always use the minimum resources until you hit the wall, then boost like crazy until you feel safe enough to go back to minimum.

    What could possibly go wrong ?

    1. Paul Hovnanian Silver badge

      Re: "We have temporarily increased the team size "

      "So, the next time there's a "power sag", you'll be in exactly the same position again ?"

      I'd like to see the person who's job it is to stare at the incoming utility voltage meter for an eight hour shift just in case it twitches.

  16. Knightlie

    Not much head-scratching needed

    "Microsoft also had trouble understanding why its storage infrastructure didn't come back online."

    For the same reason my Windows 10 machine restarts immediately when I hibernate it - your software really isn't very good.

  17. Eponymous Bastard
    Windows

    "slim" staff?

    Are they fat-shaming now at Redmond?

  18. Sparkus

    whew...

    At least it wasn't expired certificates......

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like