It wasn't just accessing Portals or Entra, any website behind a FrontDoor was suffering outages as well.
I have a lot of unhappy clients today.
If you struggled to access the Azure Portal or Microsoft Entra this morning, you weren't alone – Microsoft has blamed a Kubernetes crash for the outage. The Windows giant noted problems from 0740 UTC, with multiple regions around the world reporting issues with its services. "Our monitoring detected a significant capacity …
Yes, for a service design to give "higher availability, reduced latency, increased scalability", it's caused us many hours of grief today. We'll not know how much business we've lost EOD, but it's a real blow to MS' trust for us.
It didn't help that it took them 4 hours to even report it on the Azure Service Health page, leading us to think we had networking issues until I spoke to our MS handler, who immediately told us it was a known issue.
Yeah it whacked our org website as well.
Also had issues with our InTune portal.
Was messy indeed. Web dev lobbing tickets at infrastructure team for two hours this morning. Infra team desperately trying to work out if it's us or a wider problem.
MS really need to sort out their status pages, it's infuriating troubleshooting something that they just haven't admitted to yet.
The fault took down their Front Door CDN services. I can understand using Kubernetes for the control planes, but why does AFD run on Kubernetes? A CDN needs to have low level hardware access so that it can control TCP settings, optimise for hardware offloading, etc. The whole machine is dedicated to the job, so there isn't really a need for any containerisation. This fault should have been an inconvenience at best - not allowing AFD policies to be updated. It shouldn't have taken down the entirety of Microsoft's CDN infrastructure.
This reminds me of the VMware PSOD from the E1000 Nics where the VM sent wonky commands to the kernel and caused the node to topple over. Got even better when the same problematic recovered to the next node and caused that to pop. One client ended up having the entire production environment dirty crashed because of a VPN concentrator. That was a fun explanation when it happened twice.
There was a similar issue on Broadcom FC cards. They had some sort of counter bug, once you sent a certain amount of data the card would crash bringing the host down. The traffic load would fail on one card, the next would pick it up, that would fail so the host would crash, the VM's would fail over then start the cycle again.
Actually, I don't think it matters which bits of software were running on which systems. Microsoft contracts to manage all this for its paying users and pretends to ensure that it has the developers and testing programmes to make sure this is the case. It failed, should apologise to all, and offer some kind of recompense to those affected.
Is anyone keeping score of outages at the various gatekeepers? I seem to recall that Microsoft is handily out front…
Obviously a 2D chess player...
If an airline loses their IT systems and has to cancel all flights, they have a major PR problem. When every airline goes down at the same time because an AV vendor pushed an untested update, they just take a small hit to their short term profits. People don't even really blame them for the fact that they are still offline a week after everyone else has fully recovered.
This is why 3D chess players put their eggs in whatever basket everyone else is using.
If you struggled to access the Azure Portal or Microsoft Entra this morning, you weren't alone – Microsoft has blamed a Kubernetes crash for the outage.
That means they're running it on Linux. Good for them, but leave it to Microsoft to find a way to make even Linux unreliable.