Down we go
Microsoft 358 - and counting (down)!
Microsoft's cloudy email service, Exchange Online, decided to have an early night last night, and then enjoyed a lie-in this morning. Traditionally a night for fireworks, 5 November saw some sort of detonation within the Microsoft 365 infrastructure in the form of a borked update or, as the company delicately put it: "an issue …
[Overpaid M$ Bigwig 1]: We can't keep calling it 365 if we keep having these break-downs! We need to come up with another name. We've become the laughing-stock of IT World!
[Overpaid MS Bigwig 2]:This was the best Marketing could come up with. How 'bout we ask our World-Famous AI software to come up with a new name!
[World-Famous AI]: Office Sometimes
"Traditionally a night for fireworks, 5 November saw some sort of detonation..."
Well, I know it's just a bit of reporting fun, but to be pedantic, fireworks should only ever be undergoing deflagration rather than detonation. This is written into the Explosives Regulations 2014:
“pyrotechnic substance” means an explosive substance of a kind designed to produce an effect by heat, light, sound, gas or smoke, or a combination of any of these, as a result of non-detonative, self-sustaining, exothermic chemical reactions;
There has always been a bit of a question surrounding the speed of the shock wave in flash powder, but nobody wants to measure it too carefully in too many scenarios, just in case in turns out to sometimes be supersonic. This would result in fireworks being a lot less fun.
These outages confuse me. Are these just the "buy direct from Microsoft in the US" users?
Being a very small UK reseller of the UK hosted stuff my clients don't seem to suffer. I was directly working with one this morning in Outlook changing some settings around. So we would have noticed if it was down.
I don't pretend to know how the UK \ Europe hosting works, but does this imply UK & Europe have the sense to wait and ignore the Left Pondians doing their updates and let them test the changes first?
I don't want to defend MS on this stuff, but these reports are often a puzzle to me when I never hear complaints from my clients.
Ok, I've succumbed to the ubiquitous Smart Alec response, forgive me:-
Do they have your phone number as well as your email address?
===
On a serious note though, how does one plan for such an eventuality? With an on-prem mail server I have been able to resolve all catastrophes within four hours of notification (by telephonic or Short Message Service means) and copious proactive updates - via the same medium - to the afflicted customer. Incidentally without loss of messages. Flooding, burglary, internet outage, faulty hardware, ransomware, yep seen 'em all. DNS issues are more difficult because the failure modes are less predictable and, again that's because they rely on outsiders for their resolution.
Main vehicle for communicating MS365 failures would appear to be Twitter.
Not really "Smart Alec", it just shows more of this assumption of knowing everything without actually knowing any facts. Anyone can guess what may happen.
Seriously - I would never want to defend Microsoft, and only supply O365 because my clients have asked for it. I would love to pile in on the "it is always down" jokes but just fail to see it happening as often as reported with my small sub-set of clients.
It was a genuine question about the different between the "buy from MS and be hosted in the US" and "Buy from the European based operations". It just seems like the European hosted stuff gets less problems. Personally I just put it down to different people running the actual servers. I would not want to credit MS with anything here.
As someone who has worked in environments where "Stay until it is fixed" is the expectation, I am not comfortable with acceding to such requests, even with my advice set out in writing. The risk analysis does not stack up favourably. I prefer to be in control of a situation such that I can sleep at night.
One argument I have heard a lot of is that "If O365 goes down then everyone else is in the same boat." Whatever happened to the USP that "we have resilience where other's fail"?
Incidentally, if I am not mistaken O365 in decimal is 245.
I think the biggest issue is that when you have something running on your servers, even if the overall reliability is better than a service, manglement still persist in believing that "Cloud" is better. This is partly down to smart salespeople, being "on-trend" with the ridiculous Gartner advice and crucially, responsibility.
It does not matter how good your own Exchange may have been these idiotic mangers see a hosted service as a way of avoiding responsibility.
So, a service goes down and users are inconvenienced, these manglement idiots can with complete truth say "was are doing everything we can to fix it". In these cases, send an email to the appropriate account manager for the cloud service that has gone down. Full marks for effort, zero marks for effect. Board rooms get some spiel about SLAs and a few charts, everyone is happy except the real people that matter, the users who are trying to work.
Looking at sites such as down detector, I think the time zones help us in the uk, the outage of the 5th was reported as being mostly affected from 7pm, when most uk offices are shut for the night. If it took 12 hours to be resolved that would take it to 7am which is before we get into work.
Not defending the system but it seems likely to why we don’t notice outages in the uk that much.
I'd recommend this video from 2015 - https://channel9.msdn.com/Events/Ignite/2015/BRK3186. 15 seconds in they say it's running on 65,000 servers (but that includes SharePoint I think).
Deployments are done in batches (about 54 minutes in), monitored for results and then continued. That Means that problems won't always affect everybody, and may affect different people (if the order of the batches is rotated).
Thank you for the link.
Interesting that the system they've built is inspired by Kanban. I can understand the attraction, but can imagine that there would be times when the system becomes deluged with too much information, and I feel would be a big problem during outages.
The Red Alert concept seems to attribute visits to the System Status portal as indicative of a potential failure, and is used as a factor in trying to reduce MTTR (mean time to resolution). Maybe true, but a bit uneasy with the overall logic.
Resolution of hardware failure seems to be driven by SLA's dependent on that specific hardware, rather than the event of hardware failure being reported. It makes sense to replace hardware in batches for efficiency reasons, but interesting that SLA trumps all else to the extent that it was mentioned in the presentation.
Funniest moment was when he exclaimed "I think I saw an Edge browser in there!" looking through the browsers logged on to the service.
But, but... 'cloud' sounds so soft.. and calming... Besides, we all know that every cloud has a silver lining! How could something go wrong in such a perfect environment? We all should just trust the puffiness of clouds to bring us release from the drudgery of running these bloody server farms all day long!
Set your worries free! Send them to The Cloud!