
365, 364, 363...
-31415926535897932384626433832795 and falling...
Microsoft's Azure Active Directory (AAD) service broke down on Monday for at least some customers, thereby preventing affected Azure users from logging into and authenticating with the cloud giant's services. "Starting at approximately 1915 UTC on 15 Mar 2021, a subset of customers may experience issues authenticating into …
To quote from only fools and horses
Del:It's closed!
Trigger: (Checks watch) Well, it's a bit late, innit?.
Del: What d'you mean 'a bit late?' You said it was open twenty four hours a day.
Trigger: Yeah , but not at night!
If this the model they are aiming for...
0.01*24*365 = 87,6h
That's a lot of downtime.
In comparison, our on-site AD controllers were unreachable for the people in the company 0 hour last year. But hey, let's go in the Cloud, it's so wonderful, less expensible, more reliable.... right?
Yes, our on prem AD was fully functional last night too... and completely useless for anyone not on prem.
How about do your SSO / SAAS apps, Cloud only users etc, you know, the ones who don't have line of sight to your internal AD or those with no on-prem identity? ADFS is a pain in the rear so many companies use AAD for the external bit. Even in pass thru auth mode, if you *must* keep auth on prem, last night's wobble borked that too. (And yes, you can go to another vendor for a similar experience... but its not like Okta et al haven't had their own wobbles too)
If so, why are you not renting your spare AD capacity if yours is cheaper and more reliable than the Cloud providers? Oh, sorry, have you got spare capacity? If no, how you cope with usage spikes or DDOS attacks? Have you got multiple redundant comm lines/power lines across geographically dispersed sites? Transparent fail over? off site backups?
You could be missing loads of income, but most likely you're being dellusional, ignorant, lying to yourself, or to us. Hopefully is just ignorance, at best.
I'm getting fed up of cloud haters that just don't understand the complexities and range of services that cloud provider offer, their own capabilities or their own cost structure. They just see at their own pet server farm, their annual opex budget and say "gee, I'm cheaper" without even realizing what they DON'T have that are standard cloud features.
Not saying that there are times where on-premise could be cheaper, or just the only alternative for legal/security reasons. Just saying that the only way to compete on cost + features, security and quality with cloud providers is having the huge enconomies of scale that cloud providers leverage.
And if not, please prove me wrong and become a cloud provider yourself.
Oh redundancy, fail overs,... yeah we've seen the back of those on Azure for quite some time now. (Microsoft) Cloud is being sold as a magically more reliable, cheaper, better solution than anything on-prem. So no-one should have any sympathy for the likes of Microsoft when their all-eggs-in-a-basket AAD falls over again.
Oh, sorry, have you got spare capacity?
Yes
If no, how you cope with usage spikes or DDOS attacks?
We dimension our capacities accordingly. Our local ISPs filter DDOS attacks (yeah, we have also redundant ISPs)
Have you got multiple redundant comm lines/power lines across geographically dispersed sites?
Yes
Transparent fail over?
Yes
off site backups?
Yes - not like OVH for instance.
What is the cost of some hours downtime for a whole company? In term of missed OTD? Unsatisfied customers? Pissed-of workers?
I don't give a fuck of becoming a cloud provider. I'm not targetting the World. I just do whatever possible to satisfy my users who don't have to rely on external services to do their job and make our customers very, very happy.
So you could be making yourself rich by providing a cloud service that is better and cheaper than the big players but choose not to?
Odd to see how the smarts necessary to beat in cost and quality the hordes of highly paid, painfully recruited, very experienced, top notch engineers at Google, Microsoft and Amazon do not translate into business acumen.
That is, assuming your provided evidence, which equates exactly to nothing, is factually correct.
Please, prove me wrong with facts, otherwise just join the herd, silently downvote and go back to play armchair soccer coach on your TV or whatever else you do to fullfil your self esteem after declaring yourself the smartest IT guy in the world. Which is the extent that this discussion usually goes to and likely will be in this case.
If so, why are you not renting your spare AD capacity if yours is cheaper and more reliable than the Cloud providers? Oh, sorry, have you got spare capacity? If no, how you cope with usage spikes or DDOS attacks? Have you got multiple redundant comm lines/power lines across geographically dispersed sites? Transparent fail over? off site backups?
It would appear that microshaft don't ether!
Many of the difficulties in creating cloud services are due to them being cloud services. If I’m fine with two little DC’s, why would I care if running a mega-scale cloud is hard? Smaller systems have lower complexity, which is more manageable.
Why would MS care about SLAs? If you’ve bought into AAD, you can’t take your business elsewhere.
Years ago I went to Scotland with a couple of friends on a walking holiday. Mountain climbing in the day, football Euro championships in the evenings.
We were climbing Slioch in the cloudy weather and reached a sort of small escarpment / ridge. I realised that it would appear on the map as being more than 10m high there would be an obvious contour line. Sure enough there it was on the the good old OS map*. Orienting myself I stuck my left arm out and pointed saying "The path should be somewhere over there."
There was a break in the cloud and I was pointing directly at the path to the summit. :o)
Ohh, hang on, you meant a different sort of cloud, didn't you?
As you were.
I'll get my coat, its an all weather anorak.
*Nothing beats knowing how to use a map and compass for navigating in cloud.
The gremlins started several hours before 19:15 UTC. From noon Eastern time in US (17:00 UTC) onwards I was trying to figure out why access to an Azure Key Vault that worked yesterday didn't work today. When I finally (~5:00P) decided to look at the status in the Azure Portal, I find that the Portal says I have no subscription, hence no resources, even though it also says I am logged in. Not a good day.
Because when this happens it's not my problem.
If your AD server goes down you need to fix it. When the one in the cloud goes down you wait for someone else to fix it.
Compare this to how most people use a garage rather than maintain their own cars.
It's pretty easy to understand why people choose this.
And "The Cloud" has army's of on site 24x7 staff to fix it, access to the source code to find root cause and improve it. On site you're probably waiting for someone to drive in, find a problem, wait for a supplier patch...
"Starting at approximately 1915 UTC on 15 Mar 2021, a subset of customers may experience issues authenticating into Microsoft services, including Microsoft Teams, Office and/or Dynamics, Xbox Live, and the Azure Portal,"
A set with N elements has 2N subsets if we include the null set and the set itself, so the above quote isn't really saying a lot.
I wonder why Microsoft does not become the first large computing supply company to actually give a very rough percentage of user accounts affected? Think of the kudos they would acquire through transparency.
"I wonder why Microsoft does not become the first large computing supply company to actually give a very rough percentage of user accounts affected? Think of the kudos they would acquire through transparency."
Transparency, Microsoft? You must be kidding.
There is a more serious issue with such transparency. If every customer is affected it is fairly easy to tell and will probably be public knowledge as per the Garmin outage last year for fitness data (reported on el Reg amongst other news media). If only some are affected then there may be client confidentiality issues with telling their competitors that some services are unavailable and therefore some organisations are at a serious disadvantage competing for work or just delivering it.
Luckily in NZ things seem to be back to normal.
I did pity the AWS teams though who had federated logon to their AWS accounts via AzureAD though! It meant they couldn't log in to AWS!
Perfect timing as everyone was on trademe or watching the America's Cup!
I hope so. I retired a couple of years ago, but every other week I have a catch-up social call on MS Teams with my remaining friends back at work. I would not want to miss hearing about their trials and tribulations from the comfort of my flat.
(Although I do believe that some people actually use MS Teams for work purposes, but hey, it's a free country.)
AC coz, well, obvious really.
Preliminary Root Cause: The preliminary analysis of this incident shows that an error occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other, Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key. Metadata about the signing keys is published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end users were no longer able to access those applications.
"support a complex cross-cloud migration" sounds a lot like "we ignored our own key mgmt procedures in this instance to move a high paying customer from a competitor cloud to Azure accidentally hosing every other customer in the process."
Yes, well. I'm one of the 'can't say how many'. Not been able to log into Teams or Office 365 until about fifteen minutes ago.
' Signing in will give you a better experience'
Okay... sign in
' We're sorry but there's been an error. Restarting Teams/Office 365 now'
' Signing in will give you a better experience'
Okay... sign in
' We're sorry but there's been an error. Restarting Teams/Office 365 now'
until finally
'Your computer's trusted platform module has malfunctioned'
Well, up yours. Can someone tell me just what was wrong with a good old DVD installer? I don't need or use half the crap that comes with Office 365 and I REALLY don't need that damned 'Sign into Teams now' splash screen to keep coming up when my org. has already auto-signed me in.
Hey - left hand? What's the right hand up to today?