back to article With so many cloud services dependent on it, Azure Active Directory has become a single point of failure for Microsoft

Microsoft has fixed an issue with its OneDrive and SharePoint services where users were unable to sign in, caused by a faulty remediation for the earlier Azure Active Directory outage. "We're investigating an issue affecting access to multiple Microsoft 365 services. We're working to identify the full impact," said a Microsoft …

  1. Pascal Monett Silver badge

    "we will never be able to avoid outages entirely"

    No, you won't. Because I understand that cloud is complicated. The amount of data, the bandwidth requirements, along with the security requirements, I genuinely believe that the people who have imagined, planned, specced and built this are largely above-average in intelligence and competence.

    But, as I have said before and will not stop saying, when a company's local server falls, it only bothers the company and its customers. When The Cloud (TM) falls over, it impacts millions of people and businesses.

    It's okay though. We're still learning this computing thing. One day, we'll get the message : never build a single point of failure into your IT infrastructure.

    I don't know how that will pan out, but that's what we've got to do.

    1. MatthewSt Silver badge

      Re: "we will never be able to avoid outages entirely"

      Where do you draw the line though? To reduce single points of failure you need multiple but separate implementations of the same system (in the hope that the same bugs don't exist in the different implementations). You'd need to run them on a combination of different operating systems and different hardware platforms. Take Jabber or Email as an example, but that comes at a cost of slower improvements / increased development costs

      I doubt Azure AD is a single point of failure in anything but name. It won't be one instance running somewhere, it won't be one service or deployment package. It's probably broad enough that it's the equivalent of saying "computers are a single point of failure, does Azure depend on them too much?". The incident report will probably be lacking in details, but maybe (hopefully) it will warrant a special explainer blog post like they used to do for some outages.

      Regarding how many businesses outages affect, for the most part that doesn't bother me. Either my business is affected or it isn't, and either the businesses (or customers) that I'm communicating with are affected or they're not. In fact in some regard it might be better that it affects multiple organisations at the same time. I don't have to look embarassed and explain to a supplier that I've not received their email if they were unable to send it in the first place!

      1. Anonymous Coward
        Anonymous Coward

        Re: "we will never be able to avoid outages entirely"

        "I doubt Azure AD is a single point of failure in anything but name."

        Did you read the article? Almost all Office365 SaaS products rely on AzureAD. While AzureAD may not be a single computer, if AzureAD is unable to process login requests for any reason, new users are unable to access their SaaS applications and existing authenticated users have a default maximum of one hour before their session token expires.

        "Where do you draw the line though?"

        This comes down to the impact on your business. While it affected us globally, the main impact was the 5 hours during North American working hours. This isn't the first Azure outage in recent months and senior people are beginning to notice how Azure has issues while applications running on AWS are unaffected. While it's unlikely to alter decisions in 2020 when spending is tightly constrained, it is affecting planning for spending in 2021.

        1. neilo

          Re: "we will never be able to avoid outages entirely"

          "This isn't the first Azure outage in recent months and senior people are beginning to notice how Azure has issues while applications running on AWS are unaffected."

          It's not like AWS doesn't have issues. All manner of weird stuff stops working when AWS fall over.

          From talking with my clients, this whole "embrace the cloud" push is starting to lose steam. My locally-hosted email server falling over doesn't stop my locally-hosted ERP from running. My locally-hosted ERP falling over doesn't impact AD. AD falling over IS a problem... but having a backup AD controller solves that.

          "While it's unlikely to alter decisions in 2020 when spending is tightly constrained, it is affecting planning for spending in 2021."

          Oh yeah. And with one client alone that's a potential million dollars.

          1. Anonymous South African Coward Silver badge

            Re: "we will never be able to avoid outages entirely"

            ....but having a backup AD controller solves that.

            Any sysadmin worth his/her salt will make sure that there is a backup AD somewhere on the premises...

            1. Peter2 Silver badge

              Re: "we will never be able to avoid outages entirely"

              You only have one backup DC?

              Given the fact that AD comes included there's little practical reason not to have a VM on every physical server serving as a DC. If the physical server itself isn't the DC.

              1. Anonymous Coward
                Anonymous Coward

                Re: "we will never be able to avoid outages entirely"

                One dc per host is probably overkill, but one per cluster is not a bad idea, unless you have severely massive virtualization hosts with hundreds of vms per host.

              2. neilo

                Re: "we will never be able to avoid outages entirely"

                Not me. My client was lulled into believing that AzureAD was the way to go. I stay the heck out of infrastructure, and stick with Dynamics 365.

            2. LDS Silver badge

              Re: "we will never be able to avoid outages entirely"

              You no longer have a backup DC. All of them now are "peers" (but the forest hierarchy, of course). Sure, some roles are assigned only to fewer DCs. The idea is to have multiple DCs to make the system more resilient, and also clients can contact the nearest DC.

              One issue is exactly to ensure replication among DCs in different subnets, sites, etc. work flawlessly. If data are not replicated correctly, issues will arise. At the scale of Azure, I guess it becomes far more difficult than with a far simpler forest, while keeping the response times low.

        2. Maximum Delfango Bronze badge

          "Where do you draw the line though?"

          I draw the line so I am standing on one side of it, and Microsoft is on the other.

        3. MatthewSt Silver badge

          Re: "we will never be able to avoid outages entirely"

          "Did you read the article? Almost all Office365 SaaS products rely on AzureAD"

          I did indeed read the article. We've also configured our Azure AD to issue longer tokens so none of our users were affected by this (Access Tokens may default to an hour, but can last up to a day). The services do depend on Azure AD for log in yes, but that doesn't make it a single point of failure any more than it needing DNS, or IP traffic does, and (as it's looking like this was) the main screw-ups you get there are when processes aren't followed

        4. Anonymous Coward
          Anonymous Coward

          Re: "we will never be able to avoid outages entirely"

          The whole Office 365 thing seems a bit hacky in general.

          NoScript often complains about XSS, which is not a good thing (I trust Giorgio Maone to know a lot more about web security than Micros~1), eg:

          "NoScript detected a potential Cross-Site Scripting attack from https://outlook.office365.com to https://login.microsoftonline.com."

          ...and for some of the web apps you have to loosen your cookie settings from "Block all third-party cookies" (which you'd expect any non-spyware website to work quite happily with) to "Block cookies from unvisited websites" (I'm not really sure what the exact difference between an unvisited website versus a third-party one actually is, but why should I need cookies from either?), or they don't seem to work.

      2. oldcoder

        Re: "we will never be able to avoid outages entirely"

        "I doubt Azure AD is a single point of failure in anything but name."

        Um. It can easily be a single point of failure even with multiple servers... when one server sends the garbage to the others... they all fall down.

        1. Flywheel Silver badge

          Re: "we will never be able to avoid outages entirely"

          And presumably if an update fails or doesn't work as intended, that'll cause the same problem. I'd like to think they'd test it first though...

          1. DJV Silver badge

            test it?

            Of course it will be tested!! Just not by Microsoft, though...

      3. Doctor Syntax Silver badge

        Re: "we will never be able to avoid outages entirely"

        "I don't have to look embarassed and explain to a supplier that I've not received their email if they were unable to send it in the first place!"

        You're relying on the supplier and yourself having the same, unreliable email provider and further, on the supplier having just one email provider without having any fallback. Even as a private individual I have my own domain email but also an ancient Hotmail address. The Contact page on a website that comes back to me is duplicated on gmail and outlook addresses.

      4. Roland6 Silver badge

        Re: "we will never be able to avoid outages entirely"

        >Where do you draw the line though?

        <digression>

        Suspect having a failover cloud on Mars is going a bit far, but it might come in handy when that asteroid hits...

        </digression>

    2. Stuart Castle Silver badge

      Re: "we will never be able to avoid outages entirely"

      I haven't had much experience with Azure Active Directory, but any cloud provider can have trouble. As a protection against this, generally, where possible, it's best to have multiple providers. I don't see how this can be done with Azure Active Directory.

      1. Anonymous Coward
        Anonymous Coward

        Re: "we will never be able to avoid outages entirely"

        I don't see how this can be done with Azure Active Directory.

        By running a DC yourself hosted somewhere else.

  2. Jay Lenovo
    Trollface

    Look away.. baby, look away

    But if you ignore this choke point, it's nearly infallible.

    (..and onto another day of ignoring)

  3. Anonymous Coward
    Anonymous Coward

    Latest service update has 3 separate fail points

    From their advisory email:

    "We have identified the preliminary root cause and the extended impact as a combination of three separate and unrelated issues.

    * A code defect in a service update.

    * A tooling error in the Azure AD safe deployment system that impacted regional scoping.

    * A code defect in Azure AD’s rollback mechanism, resulting in a delay in reverting the service update."

    In my view, the second seems most serious - a regional update possibly "escaping" into the world. The other two were multiplied X-fold by that.

    Obviously, (1) and (3) are pretty serious - it means their sandbox/QA environment wasn't up to snuff, and that their rollback testing was inadequate. For such a critical component I would imagine some folks are getting 'a blowtorch to the belly'.

    AC for fairly obvious reasons.

    1. oldcoder

      Re: Latest service update has 3 separate fail points

      "...their rollback testing was inadequate..."

      their rollback testing was inadequate

      Microsoft has a testing section? naaa. They laid them off quite a few years ago.

  4. Anonymous Coward
    Anonymous Coward

    Only an idiot runs Windows on a server.

    1. Phil Kingston
      1. oldcoder

        truth hurts.

        1. Peter2 Silver badge

          Allow me to correct that; Only somebody with a paid job in IT runs windows on a server.

          And we tend to run it because the company picks a software solution and says "we want to run this because it's going to save us X". Running this bit of software is then the requirement. This runs on Windows Server, because no commercial software is written for anything else. We then buy Windows server to run it.

          If I turned around to the management and said that "running Windows is uncool and so we run Y instead, and so can't run the productivity software you want so you'll have to do it manually" then i'd be fired and my replacement would be doing what the management want to do, which is to make money.

          And Windows server can be setup to run in a fairly stable manner with a sensible architecture if somebody competent does it. If somebody incompetent manages it then it's going to be setup insecurely, unreliably and with no backups or failovers. Just the same as any other type of server.

          1. DCFusor Silver badge

            "No commercial software is written for anything else" is less and less true by the day. You must have a particular setup in mind. Do you work for one of those companies that only provides a windows version of their code? Could you name examples?

            Even the hated Oracle supports linux these days. MS is coming along, and many minor players have been there quite awhile.

  5. neilo

    This was noticed by potential customers

    I have a client running (mostly happily) Microsoft Dynamics 2012 R2. They're considering a Dynamics 365 migration (the cloud ERP beastie), but yesterday, for the bulk of their working day, email, Teams, OneDrive, Office 365, Azure AD and some Azure DNS stuff was offline because of this outage.

    It hasn't killed talk of a D365 migration dead, but it's certainly on ice now. They are buying some local servers to allow for AD authentication and DNS now, and bringing Exchange back in house isn't out of the question.

    Microsoft will survive this outage, and my client will get back on the D365 bandwagon. But they are a whole bunch more skeptical now than the day before.

    1. aregross

      Re: This was noticed by potential customers

      Notice all the M$ sales this will generate.....!

    2. Anonymous Coward
      Anonymous Coward

      Re: This was noticed by potential customers

      I had to 'bing' what Microsoft Dynamics is. Still none the wiser. Presumably when enough people use whatever it is for whatever is does, MS will drop it.

      1. jglathe

        Re: This was noticed by potential customers

        Well, IMO, they already did but didn't tell anyone. Dynamics is ERP, but more or less more than three (or five?) ERP products not really merged into one. Microsoft bought Navision A/S and some others in 2001. As of now, cloud-only (okay, you can get on-prem), and web client only.

        1. neilo

          Re: This was noticed by potential customers

          Dynamics 365 is a mess, in terms of how the various technologies are being welded together. Once the factions within Microsoft can agree on a common data structure things will improve.

          On-prem Dynamics 365 is possible... but you don't want to do it. Literally the first line in the on-prem plan is "implement Azure in your datacenter". And even assuming you have the horsepower to do that AND get D365 up and running, you **still* need Azure service bus to make things work, and you **still** need Microsoft Azure-hosted dev systems. So you can go down the expensive Azure approach, or go down the monumentally expensive on-prem approach.

  6. Mike 137 Silver badge

    "a single point of failure for Microsoft"

    Rather more importantly, a single point of failure for all its customers, potentially worldwide. Have we forgotten the Leap Day debacle that took the entire Azure service down world wide because some individual developer decided to do their own thing when calculating dates?

  7. thondwe

    But meanwhile On Premises

    AD running on premises - but Zero Day, so Patch Patch Patch - oh dear bug - oh dear AD broke for ALL sites that applied the urgent patch...

    Systems have bugs, Resilient Systems have more parts so more bugs.

    Cloud - paying someone else to look after all those problems so you can get on with running your business - so 24x7 support buy/provision tin, data centres, resilient heating, cooling, power - you pays your money and assess your benefits/risks!

    1. Doctor Syntax Silver badge

      Re: But meanwhile On Premises

      In house IT: How large does the business that directly pays their wages look to your in-house staff?

      Cloud IT. How large does your business which indirectly pays an undiscernibly small part of their wages look to Microsoft (or any other vendor) staff?

  8. Maximum Delfango Bronze badge
    Boffin

    How it will all work out.

    In around five years time 'the cloud' will be deemed too expensive, too slow and too likely to fail; we'll all switch from public clouds to private on-premises 'clouds'; we'll need a name for this collection of hardware used exclusively by an organisation so it can more safely and ably manage its risk profile and its expenses. I suggest "data centre", but I expect others can come up with something better.

    1. Anonymous South African Coward Silver badge

      Re: How it will all work out.

      Skynet or Legion.

      TBH I prefer Skynet.

    2. securityfiend

      Re: How it will all work out.

      I prefer "Mainframe"

    3. RichardBarrell

      Re: How it will all work out.

      Ah! "What will we call it if everybody goes back to on-prem" is an interesting question. Think about the attributes here:

      - it's the same kind of stuff: you still have storage, compute, managed platforms

      - it's also going to still be fluid, because you'll want to be able to manage shifting workload around the entire fleet of boxes to be cost-competitive

      - you'll know exactly where it is at any time. the physical location will be much less nebulous

      So. Same substance as a cloud, still fluid, location is known.

      I think we should call it a "puddle".

    4. DavidRa

      Re: How it will all work out.

      I believe the draft name is "Edge Computing" - locating the compute close to the users and data.

      What a concept.

      1. Anonymous Coward
        Anonymous Coward

        Re: How it will all work out.

        Fog. It’s called fog. It’s like a cloud, but it is local and it is low to the ground.

    5. LDS Silver badge
      Joke

      I suggest "data centre"

      You don't work in marketing, right?

      They will just choose another name that actually means nothing, but gives PHBs and above a positive feeling, just like "cloud".

      1. John70

        Re: I suggest "data centre"

        It's got to have at least "digital" in the name... Marketing will love that.

        1. ItWasn'tMe

          Re: I suggest "data centre"

          "Digital fog" it is then..

    6. Cliffwilliams44

      Re: How it will all work out.

      And the internet is a fad and we will all go back to snail mail and portable storage. Yeah,sure.

  9. Anonymous Coward
    Anonymous Coward

    I work for a global company that's in the middle of moving everything to Azure O365 (and all the rest) while I agree it's good from a security perspective with MFA It's hard to agree on passing control to access things over to a vendor. If we as the IT department screw up or have an issue it's on us. When things like a London Data Center catch fire or things like this happens we're powerless to fix it.

    Very hard to explain to the business we can't do anything about the problem.

    1. Cliffwilliams44

      From my perspective, it is easier on us. Management knows "we cannot fix the cloud" and everyone just "chilled out" and waited for it to be fixed. They are also much happier that they don't see me every 6 months asking for $100K to expand the storage systems and backups because they refuse to deal with email and data retention.

  10. Anonymous South African Coward Silver badge

    There's some things I'll put in the Cloud ~ teams, file hosting, backup storage (properly secured, mind) etc.

    And then there's some things I won't put in the Cloud - AD authentication, mission-critical files, files containing serious company IP, that sort of thing.

    Look, it is nice to plonk lots of files etc on a server and not having to worry about the underlying hardware, for sure. But you should be worried about the underlying hardware too - what if it is susceptible to a nasty backdoor and world+dog is able to sniff at your preciouses private data?

  11. Doogie Howser MD

    There's not now and there will never be a perfect solution to this, we live in an imperfect world. Where there is networking, storage, compute and a whole gamut of plumbing from different vendors making it all work, something will break at some stage.

    That being said, I remember the last fairly major Azure AD meltdown and it turned out that the bulk of requests went to a Texas DC which fell over. Since then, Microsoft claim to have improved this but remember that underneath the hood, AAD is nothing more than a custom build of ADAM. It's not the same as "conventional" AD and so the usual rules don't apply.

    Some people prefer on prem, and that's fine. Some people prefer cloud, that's fine. Pick the appropriate tool for the job, don't just follow dogma.

  12. Peter2 Silver badge

    We acknowledge the unfortunate reality that – given the scale of our operations and the pace of change – we will never be able to avoid outages entirely

    If you can't avoid outages given the scale of your operations and the pace of change then either reduce the scale of operations or the pace of change.

    And for that matter with respects to scale then my experience is that avoiding outages gets easier at larger scales as you can afford backups etc so the problem with Azure would appear to be poor operational management, which to be fair is a Microsoft specialty. Probably because they follow practices they try and force on other people like "simultaneously patch everything with no testing" rather than "patch a canary group and see if anything snuffs it"

    To this day, i'm not sure why my internal uptime is better than the uptime of Microsoft's cloud given that we both nominally run on the same technology and I am extremely staff and resource constrained and they have practically unlimited amounts of both.

    When it comes to "the pace of change is to fast" most of us deal with this by applying an old proverb. "If it ain't broke, don't fix it." Design something that is stable and does what people want, and then leave it alone other than fixing bugs that make it less stable. If somebody comes along demanding radical new functionality for some a random edge case then rather than breaking something that does work for everybody, design a separate service for the random edge case that is designed to work independently so if it dies then it dies, not the entire worldwide operations for everybody using anything connecting to the Microsoft cloud.

  13. Unicornpiss Silver badge
    Meh

    I miss the days when things were local

    I realize the trend is for everything to be 'cloud' based, and it makes sense for a lot of reasons, but in a way it's like manufacturing companies outsourcing their labor and sub components. 'They' don't know your business, and when something goes wrong from afar, it can really gnarl up your business, and no matter how smart your people are, it is utterly beyond your control when you put all your eggs in someone else's baskets. Besides Microsoft, look at the outages Adobe has had.

    A little off topic, but in the early days of Covid, it almost looked like there was a real possibility of some sort of apocalyptic decline. Everything is more and more dependent on services rendered offsite. It wouldn't take a lot of infrastructure disruption in the right places to bring the whole world's communications, education, business, travel, and even emergency services to their knees. And while things are a lot more robust than they were even a year ago, the trend is disturbing. It's a bit trivial, but a lot of people don't even have local copies of their music or pictures anymore, among other things. If someone pulls the plug, all that is inaccessible until things normalize enough to repair things. Even most phone calls, cellular or 'land line' make at least part of their journey through the Internet. I could easily envision a scenario where we're begging radio Hams to get a message through to aunt Sadie if things really go south.

    Where's the tinfoil hat icon?

    1. Anonymous Coward
      Anonymous Coward

      This is a Local shop for Local People

      Isn't what you describe the same all over?

      You used to buy your food from someone in your village who grew it. Now you buy it from a supermarket in the nearest town and they are supplied from a central distribution centre.... what if that caught fire?

      Your policeman used to live in the village, now he's not even in the nearest town, he's in a big police station somewhere "central".... what if that blew up?

      All of this, cost saving, centralisation. All works well until it doesn't.

      1. Dave559 Bronze badge

        Re: This is a Local shop for Local People

        Yes, but if one supermarket chain has a major problem, you can easily shop in another one.

        Not so easy for data if all your data is locked up in one company's infrastructure, and it has a bad day/week. Yes, you could possibly also use another as a live backup (if you can convince someone that it would be worth the expenditure), but I imagine it would be quite some headache trying to hot swap over from one to the other!

  14. This post has been deleted by its author

  15. ecarlseen

    Not all downtime is equal.

    With private cloud services, we balance the risk of maintenance operations agains the impact against business operations and schedule accordingly. The impact of a massive system failure at 2AM local time on a Sunday morning is not the same as 10AM on Monday morning. This does not mean nothing ever goes wrong, but it means that we tilt the odds as far as we can in our favor and for the most part it works out very well.

    With public cloud services, every site is servicing customers in every time zone and maintenance operations are performed at any time of the day or night (relative to us) with precisely zero consideration of specific customer impact and there is precisely nothing you can do about this.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2020