Design, we've heard of it
Why aren't these elementary questions like "what do we do if APNs are down" being asked at design stage?
Microsoft says last week's multi-factor authentication (MFA) partial outage, which hit its cloud-based services, was due to a dodgy network route between its servers and Apple's backend. According to a postmortem penned by the Azure team on Thursday this week, the whole thing kicked off at around 1330 UTC (0630 PDT) on Friday …
I'm guessing because, at design stage, they're not supposed to go down, so nobody bothered to build a case for testing that.
In truth, this cloud thingy is still pretty new and we're all learning the ropes. In a decade or two, when most of the bad situations have been encountered and resolved, we will then have a manual for proper design and rollout of a cloud infrastructure.
Right now I think we're still feeling our way.
Pascal, i'm not used to reading fair, even handed, reasonable assessments in this comment section. You'e thrown me off a bit ;)
They probably had 2 or 3 links that seemed resilient on paper, but real life has taught they might need 5 or 10 <shrug> Fair play to them being transparent about it.
First there was:
Writing to disk -> What happens if I suddenly don't have permission/it's full/it disappears.
Then there was:
Connecting to the database -> What happens if it's down or won't let me connect?
So really, this this not beyond the bounds of imagination:
Connecting to an online service -> What happens if it's down or won't let me connect?
Not really, distributed computing has been around for a while. Deutsch et al’s eight fallacies of distributes computing were written in 1994-1997 and are all applicable to a cloud world - https://en.m.wikipedia.org/wiki/Fallacies_of_distributed_computing
“The network is reliable” is fallacy #1. Not considering failure cases properly for a service run by a different organisation at the other end of a WAN is pretty ridiculous. Experienced designers/architects don’t trust the network between two of their own services in neighbouring racks in a physical datacentre.
(Side note: Anyone running service in a cloud should assume they’re going to eventually see some truly weird failure conditions given the multiple levels of compute and network virtualisation stacked atop each other and run on heterogenous no-name hardware. If your application’s designed and monitored correctly, it shouldn’t matter that much.)
I can understand why problems connecting to APNs would cause problems messaging and hence authenticating iOS users. What is harder to understand is why a backlog formed and caused further problems. Keeping a long backlog of remote API requests (or doing unbounded retries etc.) which are irrelevant after a few tens of seconds because they are feeding into an interactive system is not a desirable property...
Indeed. APNS has been around longer than most push notification services and has been quite robust. Apple are PITAs whenever you ask for IP address ranges for such things, and they just tell you either 17.0.0.0/8 or use which domain it is. For good reason. A shame Mircosoft can't get it's act together...
I think that Microsoft ignored Cloud for too long. AWS had a 7-year headstart before Microsoft realised Cloud was a direction they wanted to go. At that point they had to build a cloud platform super quickly and rushed to try to expand to keep up.
Case in point, Availability Zones are a new Azure thing and they only cover a subset of Azure services and are only avalible in some regions. Amazon, had the luxury to design correctly from the start without competition, and all their services are built on Availability Zones.
AWS was 7x more reliable that Azure last year. I am sure Microsoft will catch up, but they will have hicups along the way while they sort it out and fix the gaps they have.
Microsoft is pretty adept at ignoring trends. MSFT ignored the commercial Internet and spun out their own dubious and vague Microsoft Network that almost nobody used, only to rush to patch up a lot later in the game and doing a mess in the process. MSFT also ignored the clear and open standards for the web and birthed the most despised lineage of browsers ever. They also missed the start on mobile connected devices (even though their WinCE worked surprisingly well in many cases) only to shoot themselves in the foot with Windows 8/RT and enslavement of Nokia to the Evil Empire. They mocked the Open Source movement and achievements for decades only to embrace them towards the end of the second decade of this century...
They are changing their coats those years, but the nasty habbits are still there. They have enough glamour and business prowess to attract some brilliant people from time to time, though (Sysinternals comes to mind first, but for sure there are many unsung heroes in the pot).
This sounds like an outage I had, we had 2 links to our hosted 'cloud' telephony system provider, one from each building with a route between both buildings so that if one dropped all the traffic would route over the other link. What we did not realize (as the provider said they monitored all this) was that both links were running around 80% utilization, so when one dropped (thanks British gas for digging up the cable) we started dropping calls all over the place in both buildings.
It took them 2 days to admit they had not noticed the high utilization and this was the cause, after making us go through every single switch and check the QoS was as it should be...
Basic network monitoring needed here.
Solarwinds
Mrtg
Manageengine
Any of these are not hat expensive and would alert you to potential problems like this.
I once had a 1gb link that constantly ran around 99%, we got it upgraded to 10gb which then ran at 50%. Things were a bit quicker after woods. We knew what it was doing due to the solarwinds monitoring we were using.
Besides management like these sort of things because they alert without needing someone sitting constantly monitoring services.