ARM, as Azure customers will know, is a deployment and management service...
Damn, and there was me thinking it was a CPU invented by Acorn back in the 1980s!
Any insomniacs, workaholics or those pulling an all-nighter related to a past deadline project may have noted a four-and-a-half hour failure of Azure Resource Manager in Europe this morning following a recent code change. Should this have happened later in the day, admins would have pulled their hair in frustration over the …
"This issue was the result of an interaction between a recent code change in West Europe which introduced a subtle performance regression, and a specific internal Azure workload isn West Europe which exercised this performance regression in a manner which resulted in significant resource saturation."
Translation - One of our engineers tried to download the P*rnhub back catalog using a Powershell script.
I think Sir Humphrey, having shuffled off this moral coil, and been reborn an is writing preliminary outage reports. It's the type of jargon filled writing that takes a lot of words to say nothing that he loved.
First time (the leap year one) a code bug hit the entire world wide service. Now this. Seems like testing needs to be beefed up.
Now although I do, absolutely get what you are saying, but every so slightly in MS's defence (defense?) here, these systems are now so large and complex that is arguable if it is even possible to test this sort of thing prior to rolling it out?
Maybe that actually is the test, push out the updates, is it all working? If yes then breath a sigh of relief and rinse and repeat for the next time - but if it all suddenly breaks catastrophically, then, it's a case of 'oh bugger, right let's roll it all back and try again later!'
One day, MS will push out an update which has an unexpected domino effect and brings the entire global M365 system crashing down, won't (probably) be tomorrow, or next month or next year - but one day it is inevitable. Now let's assume that MS have actually mitigated against this and are in a position to roll back any changes and restore order. But there is a time lag for this, the update is pushed out, it take a while for the effects to become obvious, what initially looks like a local issue which can be dealt with, grows to encompass whole regions and zones and eventually someone senior at MS with enough clout, basically makes the call; 'oh shit, roll it all back, now!'
And of course, this all takes time, so in the meantime, what is the cost of potential lost business to companies?
"Now although I do, absolutely get what you are saying, but every so slightly in MS's defence (defense?) here, these systems are now so large and complex that is arguable if it is even possible to test this sort of thing prior to rolling it out?"
Maybe, but even AWS manages better uptimes than Azure, as does GCP (which appears to be the most reliable of the three). Which is no surprise, considering that Azure's software stack appears to be build completely on sand, using toilet roll cores and chewing gum. And it's not that Microsoft's on-premises software has any better track record in terms of reliability, so it's unlikely that "too complex to fix" is the issue here (especially when considering that MS regularly fails at fixing problems in its offline software, too)
It's completely mind-boggling to think that any business would voluntarily base it's critical infrastructure on Azure, especially considering the increasingly hefty ransoms Microsoft wants to be paid for the privilege of not caring about uptimes or security of its tenants. But yet here we are.