back to article Microsoft customers locked out of Teams, Office, Xbox, Dynamics – and Azure Active Directory breakdown blamed

Microsoft's Azure Active Directory (AAD) service broke down on Monday for at least some customers, thereby preventing affected Azure users from logging into and authenticating with the cloud giant's services. "Starting at approximately 1915 UTC on 15 Mar 2021, a subset of customers may experience issues authenticating into …

  1. Anonymous Coward
    Anonymous Coward

    365, 364, 363...

    -31415926535897932384626433832795 and falling...

    1. John Miles

      Re: 365, 364, 363...

      To paraphrase Douglas Adams

      We have a new normality. I repeat, we have a new normality. Anything that still doesn't work is therefore your own problem.

      1. TRT Silver badge

        Re: 365, 364, 363...

        I and my colleagues here are rapidly turning to penguin...

        1. msknight Silver badge

          Re: 365, 364, 363...

          Skwaak... skwaak skwaak, skwwwwark. Skwark skwark skwAAArk.

          1. TRT Silver badge

            Re: 365, 364, 363...

            Don't you feel that there's anything that you ought to be telling us?

        2. Anonymous Coward
          Anonymous Coward

          Re: 365, 364, 363...

          ... as in Linux users!

      2. Kane Silver badge
        Alien

        Re: 365, 364, 363...

        "We have a new normality. I repeat, we have a new normality. Anything that still doesn't work is therefore your own problem."

        Who's going to operate my digital watch now?

        1. TRT Silver badge

          Re: 365, 364, 363...

          I mean the odds of Azure AD going down like this must be...

          2^448,002,400:3 against

  2. Magani
    FAIL

    SANS

    French - without

    English - Software As a Non-Service

  3. Nate Amsden

    I guess they are going to miss their SLA?

    https://www.theregister.com/2021/01/06/four_nines_azure_active_directory_sla/

    1. Yet Another Anonymous coward Silver badge

      Re: I guess they are going to miss their SLA?

      Open 24hours (although not in a row)

      1. stuartnz

        Re: I guess they are going to miss their SLA?

        "Open 24hours (although not in a row)"

        Upvoted for the presumed Steven Wright allusion. :)

      2. Giles C Silver badge

        Re: I guess they are going to miss their SLA?

        To quote from only fools and horses

        Del:It's closed!

        Trigger: (Checks watch) Well, it's a bit late, innit?.

        Del: What d'you mean 'a bit late?' You said it was open twenty four hours a day.

        Trigger: Yeah , but not at night!

        If this the model they are aiming for...

    2. MatthewSt Silver badge

      Re: I guess they are going to miss their SLA?

      I thought the same thing... But that doesn't come into effect for another couple of weeks!

      Having said that, they still missed the 99.9% one anyway!

      1. Yet Another Anonymous coward Silver badge

        Re: I guess they are going to miss their SLA?

        >Having said that, they still missed the 99.9% one anyway!

        The new T&Cs say they can bring forward a future 10years of assumed perfect uptime to make the 99.%

    3. Potemkine! Silver badge
      Mushroom

      Re: I guess they are going to miss their SLA?

      0.01*24*365 = 87,6h

      That's a lot of downtime.

      In comparison, our on-site AD controllers were unreachable for the people in the company 0 hour last year. But hey, let's go in the Cloud, it's so wonderful, less expensible, more reliable.... right?

      1. Anonymous Coward
        Anonymous Coward

        Re: I guess they are going to miss their SLA?

        Yes, our on prem AD was fully functional last night too... and completely useless for anyone not on prem.

        How about do your SSO / SAAS apps, Cloud only users etc, you know, the ones who don't have line of sight to your internal AD or those with no on-prem identity? ADFS is a pain in the rear so many companies use AAD for the external bit. Even in pass thru auth mode, if you *must* keep auth on prem, last night's wobble borked that too. (And yes, you can go to another vendor for a similar experience... but its not like Okta et al haven't had their own wobbles too)

      2. SecretSonOfHG

        Re: I guess they are going to miss their SLA?

        If so, why are you not renting your spare AD capacity if yours is cheaper and more reliable than the Cloud providers? Oh, sorry, have you got spare capacity? If no, how you cope with usage spikes or DDOS attacks? Have you got multiple redundant comm lines/power lines across geographically dispersed sites? Transparent fail over? off site backups?

        You could be missing loads of income, but most likely you're being dellusional, ignorant, lying to yourself, or to us. Hopefully is just ignorance, at best.

        I'm getting fed up of cloud haters that just don't understand the complexities and range of services that cloud provider offer, their own capabilities or their own cost structure. They just see at their own pet server farm, their annual opex budget and say "gee, I'm cheaper" without even realizing what they DON'T have that are standard cloud features.

        Not saying that there are times where on-premise could be cheaper, or just the only alternative for legal/security reasons. Just saying that the only way to compete on cost + features, security and quality with cloud providers is having the huge enconomies of scale that cloud providers leverage.

        And if not, please prove me wrong and become a cloud provider yourself.

        1. sgp Bronze badge

          Re: I guess they are going to miss their SLA?

          Oh redundancy, fail overs,... yeah we've seen the back of those on Azure for quite some time now. (Microsoft) Cloud is being sold as a magically more reliable, cheaper, better solution than anything on-prem. So no-one should have any sympathy for the likes of Microsoft when their all-eggs-in-a-basket AAD falls over again.

        2. Potemkine! Silver badge

          Re: I guess they are going to miss their SLA?

          Oh, sorry, have you got spare capacity?

          Yes

          If no, how you cope with usage spikes or DDOS attacks?

          We dimension our capacities accordingly. Our local ISPs filter DDOS attacks (yeah, we have also redundant ISPs)

          Have you got multiple redundant comm lines/power lines across geographically dispersed sites?

          Yes

          Transparent fail over?

          Yes

          off site backups?

          Yes - not like OVH for instance.

          What is the cost of some hours downtime for a whole company? In term of missed OTD? Unsatisfied customers? Pissed-of workers?

          I don't give a fuck of becoming a cloud provider. I'm not targetting the World. I just do whatever possible to satisfy my users who don't have to rely on external services to do their job and make our customers very, very happy.

          1. SecretSonOfHG

            Re: I guess they are going to miss their SLA?

            So you could be making yourself rich by providing a cloud service that is better and cheaper than the big players but choose not to?

            Odd to see how the smarts necessary to beat in cost and quality the hordes of highly paid, painfully recruited, very experienced, top notch engineers at Google, Microsoft and Amazon do not translate into business acumen.

            That is, assuming your provided evidence, which equates exactly to nothing, is factually correct.

            Please, prove me wrong with facts, otherwise just join the herd, silently downvote and go back to play armchair soccer coach on your TV or whatever else you do to fullfil your self esteem after declaring yourself the smartest IT guy in the world. Which is the extent that this discussion usually goes to and likely will be in this case.

        3. SecretSonOfHG

          Re: I guess they are going to miss their SLA?

          Ahh, the smell of downvotes in the morning... come on, my ratio of up/downvotes is huge. While you're at it, take the time not only to downvote but also to refute any arguments. With facts, of course.

        4. very angry man

          Re: I guess they are going to miss their SLA?

          If so, why are you not renting your spare AD capacity if yours is cheaper and more reliable than the Cloud providers? Oh, sorry, have you got spare capacity? If no, how you cope with usage spikes or DDOS attacks? Have you got multiple redundant comm lines/power lines across geographically dispersed sites? Transparent fail over? off site backups?

          It would appear that microshaft don't ether!

        5. P. Lee

          Re: I guess they are going to miss their SLA?

          Many of the difficulties in creating cloud services are due to them being cloud services. If I’m fine with two little DC’s, why would I care if running a mega-scale cloud is hard? Smaller systems have lower complexity, which is more manageable.

          Why would MS care about SLAs? If you’ve bought into AAD, you can’t take your business elsewhere.

      3. Eclectic Man Silver badge
        Coat

        Re: I guess they are going to miss their SLA?

        Years ago I went to Scotland with a couple of friends on a walking holiday. Mountain climbing in the day, football Euro championships in the evenings.

        We were climbing Slioch in the cloudy weather and reached a sort of small escarpment / ridge. I realised that it would appear on the map as being more than 10m high there would be an obvious contour line. Sure enough there it was on the the good old OS map*. Orienting myself I stuck my left arm out and pointed saying "The path should be somewhere over there."

        There was a break in the cloud and I was pointing directly at the path to the summit. :o)

        Ohh, hang on, you meant a different sort of cloud, didn't you?

        As you were.

        I'll get my coat, its an all weather anorak.

        *Nothing beats knowing how to use a map and compass for navigating in cloud.

        1. Yet Another Anonymous coward Silver badge

          Re: I guess they are going to miss their SLA?

          >there would be an obvious contour line. Sure enough there it was

          That would be a good infrastructure project.

          Go around with one of those pitch white-line marking machines painting actual contour lines on Scottish mountains

      4. Potemkine! Silver badge
        Coat

        Re: I guess they are going to miss their SLA?

        Edit: I just missed the "%". the real value is 0.01 * 24 * 365 / 100 = 0,876h = 52 min 33,6 s.

        I told you I wasn't well awake this morning :sigh:

        So yeah, they obviously missed their target for 2021.

        1. Yet Another Anonymous coward Silver badge

          Re: I guess they are going to miss their SLA?

          I wonder how many great business decisions have been made because of the difference between "0.01" and "0.01%" in a spreadsheet

  4. Version 1.0 Silver badge
    Happy

    A system restore can take a while but generally Microsoft system restores normally work well - it's the highlight in a set of operating systems that suck. We'll all be back to normal in a few hours with any luck.

    1. TRT Silver badge

      They get a lot of practice.

  5. cupplesey

    Well tomorrow is gonna suck.....

  6. TurtleBeach

    I Hope El Reg Stays on This

    The gremlins started several hours before 19:15 UTC. From noon Eastern time in US (17:00 UTC) onwards I was trying to figure out why access to an Azure Key Vault that worked yesterday didn't work today. When I finally (~5:00P) decided to look at the status in the Azure Portal, I find that the Portal says I have no subscription, hence no resources, even though it also says I am logged in. Not a good day.

    1. Raphael

      Re: I Hope El Reg Stays on This

      yes, it has caused minor panic when we suddenly could not find any of our resources

  7. Terry 6 Silver badge

    Tell me again why putting everything in the Cloud is such a brilliant idea.

    1. ecofeco Silver badge

      Becasue it's what all the cool kids are doing!

      1. Sanctimonious Prick
        Coat

        When?

        When would be a good time to ask; "So, after all the cloud outages, are you still using the cloud?"

        Mine's the one with the remote NAS in the left pocket, and the local NAS in the right pocket -->

    2. sabroni Silver badge
      Happy

      re: Tell me again why putting everything in the Cloud is such a brilliant idea.

      Because when this happens it's not my problem.

      If your AD server goes down you need to fix it. When the one in the cloud goes down you wait for someone else to fix it.

      Compare this to how most people use a garage rather than maintain their own cars.

      It's pretty easy to understand why people choose this.

      1. thondwe

        Re: re: Tell me again why putting everything in the Cloud is such a brilliant idea.

        And "The Cloud" has army's of on site 24x7 staff to fix it, access to the source code to find root cause and improve it. On site you're probably waiting for someone to drive in, find a problem, wait for a supplier patch...

        1. Santa from Exeter

          Re: re: Tell me again why putting everything in the Cloud is such a brilliant idea.

          @thondwe 'The Cloud' might have 24X7 staff, but at the rate that Microsoft fixes yet another Azure issue they caused it would be quicker to drive in and fix the on-prem stuff it replaced.

      2. Santa from Exeter

        Re: re: Tell me again why putting everything in the Cloud is such a brilliant idea.

        @sabroni. It might not be your problem but round here we actually give a shit about our users, so we share their problems when Azure fucks up yet again for our 24X7 staff

      3. Pascal Monett Silver badge

        Re: most people use a garage rather than maintain their own cars

        Most people use a garage because cars are increasingly made so most people can't maintain them.

        Bad analogy.

      4. hplasm
        Devil

        Re: re: Tell me again why putting everything in the Cloud is such a brilliant idea.

        "When the one in the cloud goes down you wait for someone else to fix it."

        True. But- the bloody phone never stops asking when *you* are going to fix it.

    3. Antron Argaiv Silver badge

      Single point of failure?

    4. J27 Silver badge

      It's great if your company can't afford a dedicated slice of datacenter. You get services you could never get if your company had to run the whole infrastructure itself. Otherwise it's stupid, once a company is big enough, it's just a liability and the advantages melt away.

    5. Anonymous Coward
      Anonymous Coward

      Think of it...

      .. as snow days for adults.

  8. HildyJ Silver badge
    Devil

    Maybe...

    They need "Google’s site reliability senseis ... to train [them] in their mystical ways."

    I love it when ElReg posts come together.

    1. J27 Silver badge

      Re: Maybe...

      I've used Azure, Google and AWS. They're all about the same level or reliable... which is pretty good, not perfect. You just pray to the tech gods that it doesn't fail during your peak usage hours.

  9. Tron

    Have they tried etc etc and on again.

    Put all of your eggs in someone else's basket, they said. It'll be fine, they said.

    1. TRT Silver badge

      Re: Have they tried etc etc and on again.

      Because when there’s a problem they’ll scramble every resource.

      1. Pascal Monett Silver badge
        Coffee/keyboard

        Good one

      2. David 132 Silver badge
        Coat

        Re: Have they tried etc etc and on again.

        ...working ova-time if necessary.

        1. TRT Silver badge

          Re: Have they tried etc etc and on again.

          It’s no yolk.

  10. anothercynic Silver badge

    Yep, it's affected us...

    But not a peep from our on-site business continuity peeps. I guess they can't send anything because Microsoft?

    1. Yet Another Anonymous coward Silver badge

      Re: Yep, it's affected us...

      They will have a teams meeting to decide how to send a message to everyone on teams to say when teams will be back up

  11. keithpeter Silver badge
    Windows

    Subsets...

    "Starting at approximately 1915 UTC on 15 Mar 2021, a subset of customers may experience issues authenticating into Microsoft services, including Microsoft Teams, Office and/or Dynamics, Xbox Live, and the Azure Portal,"

    A set with N elements has 2N subsets if we include the null set and the set itself, so the above quote isn't really saying a lot.

    I wonder why Microsoft does not become the first large computing supply company to actually give a very rough percentage of user accounts affected? Think of the kudos they would acquire through transparency.

    1. Eclectic Man Silver badge

      Re: Subsets...

      "I wonder why Microsoft does not become the first large computing supply company to actually give a very rough percentage of user accounts affected? Think of the kudos they would acquire through transparency."

      Transparency, Microsoft? You must be kidding.

      There is a more serious issue with such transparency. If every customer is affected it is fairly easy to tell and will probably be public knowledge as per the Garmin outage last year for fitness data (reported on el Reg amongst other news media). If only some are affected then there may be client confidentiality issues with telling their competitors that some services are unavailable and therefore some organisations are at a serious disadvantage competing for work or just delivering it.

    2. Pascal Monett Silver badge

      Re: the above quote isn't really saying a lot

      Of course not, it's just another variation on the age-old "a small amount of customers have been impacted", for small < 100%.

      1. Eclectic Man Silver badge

        Re: the above quote isn't really saying a lot

        Compared to "only a small number of customers are affected" for some 'small number' < infinity.

        1. Yet Another Anonymous coward Silver badge

          Re: the above quote isn't really saying a lot

          Only a finite number of customers, less than INT_MAX, have been affected

  12. Tim99 Silver badge

    Not all bad?

    I have a mail account address that I give out to certain people expecting it to be spammed. Normally it gets about 20 spam messages (a rule sends almost all straight to junk) but yesterday it had just normal mail. This morning the spam is back, not a coincidence?

  13. disk iops

    whatever happened to 1-node, then 1-dc, then 1-site, then world roll-out strategy? What is this, "it compiled, ship it" level of testing and deployment?

  14. Tom 7 Silver badge

    When you own the whole system

    who are you going to blame?

  15. GreenJimll

    Graph API too

    Unsurprisingly this also affected Graph API calls into the Microsoft Cloud. Luckily my code is used to dealing with flakey Microsoft Graph API responses. It does get a lot of practice after all.

  16. Anonymous Coward
    Anonymous Coward

    Remember not to federate AWS to AzureAd!

    Luckily in NZ things seem to be back to normal.

    I did pity the AWS teams though who had federated logon to their AWS accounts via AzureAD though! It meant they couldn't log in to AWS!

    Perfect timing as everyone was on trademe or watching the America's Cup!

  17. Anonymous Coward
    Anonymous Coward

    On the mend!

    I hope so. I retired a couple of years ago, but every other week I have a catch-up social call on MS Teams with my remaining friends back at work. I would not want to miss hearing about their trials and tribulations from the comfort of my flat.

    (Although I do believe that some people actually use MS Teams for work purposes, but hey, it's a free country.)

    AC coz, well, obvious really.

  18. John70

    "for at least some customers"

    Understatement of the year

  19. maddoxx

    Kremlin?

    and I thought it is always the Kremlin.... and not the pixies

  20. Steve Davies 3 Silver badge
    Mushroom

    Cloud is great, brilliant, wonderful

    until it breaks

    sorta like this.

    Stay tuned for another instalment in the "Hey move to my cloud it is super reliable and brilliant and will always be there for you!"

    never-ending story.

    One of these days, it will all go [see icon]

    1. Stevie Silver badge

      Re: Cloud is great, brilliant, wonderful

      One big cloudy basket for your eggs.

  21. Stevie Silver badge

    Bah!

    I hate TEAMS with a passion.

    So it was supreme irony that yesterday I seemed to be the only person who could use the bloody thing sans nevereverspin or bizarro error messages.

  22. Someone Else Silver badge

    You mean, there's a backup process?

    If only it were that simple. Shortly after that, Microsoft warned, "The process to roll back the change is taking longer than expected. We'll provide an ETA as soon as one becomes available."

    "Dammit, I knew I put that backup around here somewhere...."

  23. Anonymous Coward
    Anonymous Coward

    Apparently Forms is now fixed but many of my users are still unable to access a subset of them.

  24. Anonymous Coward
    Anonymous Coward

    Preliminary Root Cause

    Preliminary Root Cause: The preliminary analysis of this incident shows that an error occurred in the rotation of keys used to support Azure AD’s use of OpenID, and other, Identity standard protocols for cryptographic signing operations. As part of standard security hygiene, an automated system, on a time-based schedule, removes keys that are no longer in use. Over the last few weeks, a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key. Metadata about the signing keys is published by Azure AD to a global location in line with Internet Identity standard protocols. Once the public metadata was changed at 19:00 UTC, applications using these protocols with Azure AD began to pick up the new metadata and stopped trusting tokens/assertions signed with the key that was removed. At that point, end users were no longer able to access those applications.

    "support a complex cross-cloud migration" sounds a lot like "we ignored our own key mgmt procedures in this instance to move a high paying customer from a competitor cloud to Azure accidentally hosing every other customer in the process."

    1. Eclectic Man Silver badge

      Re: Preliminary Root Cause

      When I was learning about cryptography, last century / millennium, I was informed that after getting a decent algorithm, the most difficult part was the key management. Sounds like this key management was quite complicated.

      1. Ken Moorhouse Silver badge

        Re: Sounds like this key management was quite complicated.

        The Data Centre janitor was cleaning the doormat and the key bounced into the nearby shrubbery.

        1. WhereAmI?
          FAIL

          Re: Sounds like this key management was quite complicated.

          Yes, well. I'm one of the 'can't say how many'. Not been able to log into Teams or Office 365 until about fifteen minutes ago.

          ' Signing in will give you a better experience'

          Okay... sign in

          ' We're sorry but there's been an error. Restarting Teams/Office 365 now'

          ' Signing in will give you a better experience'

          Okay... sign in

          ' We're sorry but there's been an error. Restarting Teams/Office 365 now'

          until finally

          'Your computer's trusted platform module has malfunctioned'

          Well, up yours. Can someone tell me just what was wrong with a good old DVD installer? I don't need or use half the crap that comes with Office 365 and I REALLY don't need that damned 'Sign into Teams now' splash screen to keep coming up when my org. has already auto-signed me in.

          Hey - left hand? What's the right hand up to today?

          1. MarkSitkowski

            Re: Sounds like this key management was quite complicated.

            Do you really need an answer to that last question...?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2021