Sounds like just another day in a data centre. Will done Microsoft for being honest and transparent.
Microsoft admits slim staff and broken automation contributed to Azure outage
Microsoft's preliminary analysis of an incident that took out its Australia East cloud region last week – and which appears also to have caused trouble for Oracle – attributes the incident in part to insufficient staff numbers on site, slowing recovery efforts. The software colossus has blamed the incident on "a utility power …
COMMENTS
-
-
Monday 4th September 2023 11:07 GMT Jellied Eel
Microsoft, heal thyself..
Storage hardware damaged by the data hall temperatures "required extensive troubleshooting" but Microsoft's diagnostic tools could not find relevant data because the storage servers were down.
Can they ever? I have a PC that bluescreens when waking from sleep. Rather than telling me anything useful, it gives me a QR code that probably takes me to a web page that probably doesn't tell me anything either. And might be on a server in the datacentre that's currently offline.
So no wonder the SOP for Microsoft diagnostic & trouble shooting is still turn it on/off and hope for the best.
-
Wednesday 6th September 2023 09:13 GMT Pascal Monett
Borkzilla doesn't know how to manage sleep.
My SOP when getting a new computer is to modify the standard power config to forbid sleep and hibernation.
As far as I'm concerned, when I need that PC, it is supposed to be awake and ready to roll. If I don't need it, I shut it down. Especially these days when boot times are livable.
-
Wednesday 6th September 2023 09:15 GMT trindflo
Bluescreens
Windows bluescreens are always a driver problem, and most of the time is a bad memory reference. Sometimes the driver requires bad hardware to flex its nethers, but if it happens on waking from sleep it probably doesn't need any help from the hardware and it just doesn't know how to wake up properly.
I use BlueScreenView to get an idea of what is happening (unless I wrote the driver, in which case Windbg gives a lot more information). BlueScreenView is a portable app (you don't need to install it). It will identify the problem driver. If you can live without the hardware disable it in Device Manager.
-
-
Thursday 7th September 2023 07:24 GMT trindflo
Re: Bluescreens
You have a point that hardware can cause BSODs when the driver is written correctly: RAM as you mention, overclocking, failing to install the heatsink properly on the CPU, power supply. Those will have a random source.
If the trap is consistently in the same driver, it is the driver. I've not seen antivirus or anticheat bugs cause bluescreens without installing a companion driver - if something in ring 3 can cause a BSoD by using a callback incorrectly or sending bad data to the driver, I'd still say that is the fault of the driver.
-
-
-
-
Monday 4th September 2023 17:34 GMT martinusher
Everything is perfect, until it isn't
I've been caught by this. The problem is that when everything's running smoothly the bean counters start looking at you as a waste of money, head count and resources to be slimmed down. Since things don't break immediately their efforts produce immediate rewards (which they'll pocket a good bit of because they're so clever to notice where they can make those savings). Then something goes wrong.
At that point you'll discover that there really is surplus manpower. It will be meetings and committees convened to point the finger of blame at whoever is responsible.
The lessons learned from this is that you need just enough breakage / slippage in the work to ensure that everyone looks busy enough.
-
Monday 4th September 2023 18:08 GMT froggreatest
Re: Everything is perfect, until it isn't
Bean counters are unaware of the failures per se. This is all to do with the management who green-light the fact that it is fine to have a few people. I believe one of the lessons here is that there should be a minimum of x people per some y size datacenter.
-
-
-
Wednesday 6th September 2023 17:53 GMT DS999
Even if both spare chillers worked instead of just one they would have had an outage. The problem was the five chillers that wouldn't restart after a power sag. You can test power loss scenarios by flipping breakers, but power sags/surges are another matter.
I wonder if they rotate active/standby chillers regularly so that they all get their startup regularly tested and even out the wear? Perhaps that might have caught whatever caused the second standby chiller to not start. At least that would have bought them a bit more time before things got too hot.
-
-
-
Tuesday 5th September 2023 22:20 GMT Roo
The Cloud Vendors keep telling us that they offer a cheaper* option through economies of scale (and better utilization of hardware).
The flip side of this is that they have less redundant kit lying around to run a proper test, and if the test goes wrong (eg: cooling systems) then the scale of the outage is much bigger and the time & effort to recover can also be geometrically scaled accordingly. Also because of the resources are so readily interchangeable and by necessity interlinked failures will cascade fairly readily as well. So the downside of a test going wrong is hugely expensive for them (and their customers), and arguably it's probably cheaper and easier to assume everything will be fine instead of causing outages deliberately.
Personally I like a bit of fat in the system and the ability to tightly contain failures, but the folks paying the dosh like zero fat, moving fast, breaking things, and failing big.
-
Wednesday 6th September 2023 11:26 GMT trindflo
Economies of scale
Economies of scale does amplify the disadvantages along with the advantages. I think the bigger issue is businesses that prioritize charging rents over providing services. A business that can get away with prioritizing profits to the point where the services fail has no effective competition.
"staffing of the team at night was insufficient [...] We have temporarily increased the team size"
That reeks of window dressing.
-
-
-
Tuesday 5th September 2023 18:42 GMT mikus
Well, who can blame them as a small business, one has to make sure to right-size their staffing accordingly. They can't afford to do it like those real "large" IT companies do it. Maybe those relying on this solution should shop for a more appropriate sized organization and product offering that can actually meet their needs for redundancy and scalability.
-
Wednesday 6th September 2023 09:17 GMT Pascal Monett
"We have temporarily increased the team size "
Temporarily.
So, the next time there's a "power sag", you'll be in exactly the same position again ?
That's manglement for you. Always use the minimum resources until you hit the wall, then boost like crazy until you feel safe enough to go back to minimum.
What could possibly go wrong ?
-