back to article AWS outage exposes Achilles heel: central control plane

Amazon's US-EAST-1 region outage caused widespread chaos, taking websites and services offline even in Europe and raising some difficult questions. After all, cloud operations are supposed to have some built-in resiliency, right? The problems began just after midnight US Pacific Time today when Amazon Web Services (AWS) …

  1. JimmyPage Silver badge
    Facepalm

    It's DNS. No Surprise.

    It's always DNS.

    1. Michael

      Re: It's DNS. No Surprise.

      Apart from when it's BGP. But it's normally DNS.

    2. cookiecutter Silver badge

      Re: It's DNS. No Surprise.

      unless it's dave from accounts clicking the random link

    3. Roland6 Silver badge

      Re: It's DNS. No Surprise.

      Going off on a bit of a tangent, no DNS means engineers having to type in IP addresses. Something that wasn’t too difficult in a structured private IPv4 environment; but in the loosely structured world of IPv6 (needed for cloud) where people have hand-waved at address allocation and let the machine do auto assignments a much more challenging task. Would not thus be surprised that a contributor to the recovery delay was people having to locate individual IP addresses from systems they don’t know the IP address of.

  2. ComicalEngineer Silver badge

    The fragility of the system

    Modern computer systems are so fragile that a single line of code can bring down the house of cards.

    It's a sad indictment of the state of our computer systems (and industry) that the whole lot can go mammaries skywards so easily.

    It's long past time that companies started to get more active on making more resiliant software.

    1. Throatwarbler Mangrove Silver badge
      Devil

      Re: The fragility of the system

      In fairness, that's always been the case. A wayward line of code in an OS kernel could certainly bring your system to a screeching halt, no matter how resilient the rest of it is. The main things which have changed are that so many distributed systems are dependent on external resources and that, in the case of AWS users, those systems are apparently dependent on a single point of failure. Resolving these problems is a trivial task and thus left as an exercise for the reader.

    2. Throg

      Re: The fragility of the system

      One could argue that it's actually a sign of the complexity of those systems.

      After all, something simple like rupturing a fuel line in a car can cause all sorts of problems. Is this any different?

      I'm not excusing poor design and deployment, and heaven knows we see far too much of that especially now we have "professional managers" running the patch, but engineering of all kinds is all about finding the balance between cost and the acceptable likelihood of failure. Let's hope that the relevant parties here will reassess that based on today's incident.

      1. Neil Barnes Silver badge

        Re: The fragility of the system

        But perhaps the metaphor is a single fuel line rupturing that stops _all_ cars?

        I have always been sceptical about the 'somebody else's computer' model, though (along with a firm feeling that lots and lots of data is being processed that simply doesn't need to be).

        1. The Mole

          Re: The fragility of the system

          A ruptured fuel line on a motorway will stop all cars on the motorway so the analogy seems accurate to me.

          Of course it's only the road closing ruptured fuel line you hear about, not all the other ones that have no impact...

  3. Doctor Syntax Silver badge

    From the Wikipedia article on DNS: This mechanism provides distributed and fault-tolerant service and was designed to avoid a single large central database. (my emphasis)

    Somebody forgot the "distributed" part and that fault tolerance depends on it.

    1. Throg

      It sounds like the DNS service was robust. It was robustly pointing things at the wrong place...

  4. JimmyPage Silver badge
    Megaphone

    This isn't com[lexity.

    It's penny pinching.

    Somewhere I can guarantee there is a component that is a single point of failure that wasn't protected as it would have cost too much.

    By definition, if you have redundancy, you have inefficiency. By all means choose efficiency over resilience, but for the love of god, own it when things go wrong.

    1. elsergiovolador Silver badge

      Re: This isn't com[lexity.

      There is also lack of care. Due to vendor lock in, what those businesses, government departments are going to do?

    2. Notas Badoff

      Re: This isn't com[lexity.

      Still tickled by an old posting...

      Abend's Observation: "Many cloud systems are actually just distributed single points of failure"

      And here the distributed single point of failure that all were referencing was AWS?

    3. DS999 Silver badge

      Re: This isn't com[lexity.

      wasn't protected as it would have cost too much

      Nope, not a company like Amazon. It is because:

      1) no one realized it was and always has been a single point

      2) originally it was protected, but other changes caused it to become a single point without anyone realizing

      3) they know it is a single point and there is a huge project underway to address that but it isn't complete yet

      4) they know it is a single point but they can't address that without basically throwing out their entire design and starting over from scratch

      1. Timop

        Re: This isn't com[lexity.

        Why would no-one been able to realise that there was single point of failure?

        Like there would not be capable people able to identify if they were assigned to investigate.

        In places where there are KPIs, risks of people just not noticing something extremely important that lies outside the KPIs is pretty obvious. And that is exactly what the management is telling everyone to do.

        1. Yes Me

          Re: This isn't com[lexity.

          "Why would no-one been able to realise that there was single point of failure?"

          There is always a single point of failure in a complex system*. The trick is finding it.

          *Yes, there is, really, however much redundancy is provided. That's a theorem.

          1. JimmyPage Silver badge
            Boffin

            Re: This isn't com[lexity.

            Max flow/min cut. Ford Fulkerson and all that

            Yes it was 40 years ago at Uni, but I listened

          2. DiggerDave

            Re: This isn't com[lexity.

            Repeat after me: Repetition is not Redundancy

        2. DS999 Silver badge

          Re: This isn't com[lexity.

          Why would no-one been able to realise that there was single point of failure?

          Do you have no conception of how complex Amazon's environment is? There is no one person who understands it all. There will be people who understand it all from a high level but not down to how every tiny subsystem acts and what it interfaces with, and people who understand a piece of it down to the tiny details but the rest only at a high level. There could be a piece that's got redundancy via its interactions with other subsystems but then those various subsystems were upgraded/enhanced/bugfixed over the years and no one noticed the redundancy they provided no longer exists.

          Even if they know there is a (and I say 'a' as if there's only one which there may not be) single point of failure that's unfixed (because fixing it is a long project not yet complete, or would be impractical due to the scale of the "fix") until that fails they may not have realized how much shrapnel its failure delivers through the overall system. As a result, their previously never tested for obvious reasons "this is what you do if SPOF 'x' fails" cookbook to fix it proves to have overlooked a few things which they have to figure out on the fly.

      2. Claptrap314 Silver badge

        Re: This isn't com[lexity.

        My observation is that Amazon did not have the right people in place to build a resilient system for a number of years. Resilience has been an afterthought, and basically has been wrapped & bolted on after the fact. The fact that IAM requires US-E to be up in order to function properly is a massive joke on the resilience front, for instance. The fact that they were configuring S3 from S3 without having the golden dataset elsewhere until that blew up is another demonstration of this issue.

        Of course, this was BEFORE Bezos went all Jim Jones with this FakeAI nonsense. I would not be surprised for a minute if letting the bodies hit the floor was not the true root cause.

    4. Anonymous Coward
      Anonymous Coward

      Re: This isn't com[lexity.

      > there is a component that is a single point of failure that wasn't protected as it would have cost too much.

      Quite likely. And just as likely that spof is a lone (overworked, unappreciated, unsupported) engineer survivor/hostage in a formerly-robust group of experienced professionals.

      As the former co-workers left, so did their skills and tribal knowledge.

      Wouldn't be that surprised if the last engineer is in Nebraska ....

  5. elsergiovolador Silver badge

    HMRC

    and even government services such as tax agency HMRC were impacted.

    Surely HMRC shouldn't have anything stored in US-EAST-1 no? and if they had I am sure government will promptly launch an investigation?

    (setting aside the whole omnishambles of using AWS at all - due to Cloud Act, tax payer data are not safe)

    1. Altrux

      Re: HMRC

      They may well have no /data/ stored there - it should all be in the UK region (eu-west-2). But as the article explains, it turns out that the entire global AWS cloud still has critical dependencies on its 'mothership' region, the original hub in N Virginia.

      1. elsergiovolador Silver badge

        Re: HMRC

        As far as I know, the UK isn’t a dependency of the US.

        This isn’t just an inconvenience - it’s a sovereignty issue. It’s absurd that critical UK government systems can go down because something broke in Virginia. Whether or not data is physically stored there is irrelevant: AWS is bound by the US Cloud Act, and no amount of “UK region” branding changes that.

        1. elsergiovolador Silver badge

          Re: HMRC

          To clarify - governments are not exempt from US Cloud Act, so placing data in eu-west-2 is just coping mechanism. It has nothing to do with ensuring safety of data about us. This whole procurement should have been audited.

          1. cookiecutter Silver badge

            Re: HMRC

            government should not be using any US hyperscaler. its laziness from crown commercial & matybe m envelopes.

            UK government infrastructure & data should be on UK owned infrastructure

        2. MatthewSt Silver badge

          Re: HMRC

          > it’s a sovereignty issue

          We've exercised our sovereignty and chosen to store our data using a US provider.

          Maybe ask your MP why they're not using https://crownhostingdc.co.uk/

          1. elsergiovolador Silver badge

            Re: HMRC

            I wouldn't be so sure that this is a British entity.

            https://www.datacenterdynamics.com/en/news/ark-owner-elliott-investment-looks-to-sell-uk-data-center-firm-report/

            US Cloud Act likely apply there as well.

            1. rg287 Silver badge

              Re: HMRC

              I wouldn't be so sure that this is a British entity.

              ...

              US Cloud Act likely apply there as well.

              Crown Hosting is a UK business of which all the directors are British nationals. It is owned by Ark Data Centres Ltd and the Cabinet Office. Ark is also a UK Ltd, and all the directors are British nationals.

              The fact that Ark is in turn owned by an American somewhere up the tree doesn't mean that the American owner can make them ship data out to the US. I mean, the US might try, and they might even imprison the US owner. But they have no direct hold over the UK directors, who would say "Nah, sorry. It would be a crime in our jurisdiction to send you that data and we're not going to prison for you".

              It's a UK based/managed/engineered firm that happens to have some American money in it somewhere up the chain.

              The ability of a US financier to dip into arbitrary servers in a UK datacentre is quite different (and more limited!) than for AWS, where the European regions are run by a US-based engineering effort and the US engineers - under duress - can in fact dip into overseas regions at will, despite the fact that AWS UK is merely a branch of AWS EMEA SARL in Luxembourg, and is registered as such with Companies House, declaring themselves under Luxembourg law, with a bunch of Luxembourg-based directors.

              1. skwdenyer

                Re: HMRC

                If the US owner has the power to appoint / fire Directors of the UK entity, they almost certainly have the power to exercise direct control *despite* the objections of current UK Directors.

                1. rg287 Silver badge

                  Re: HMRC

                  If the US owner has the power to appoint / fire Directors of the UK entity, they almost certainly have the power to exercise direct control *despite* the objections of current UK Directors.

                  Well yes. They can say "do this or you're fired".

                  However, if the alternative is a criminal conviction, then most people would walk - directors are not necessarily employees, so wouldn't be able to sue for unlawful dismissal. But realistically, the engineers setting up access (on orders from domestic or parent directors) will be. They have a lawful duty to say "UK law says go swivel" and if they're sacked for that, then they'll win a year's salary at the employment tribunal because refusing to break UK law is automatically going to be unfair dismissal.

        3. Anonymous Coward
          Anonymous Coward

          Re: HMRC

          >> the entire global AWS cloud still has critical dependencies on its 'mothership' region

          > As far as I know, the UK isn’t a dependency of the US

          Grabbing the wrong end of the stick there

          1. elsergiovolador Silver badge

            Re: HMRC

            Well, at least not officially...

    2. Anonymous Coward
      Anonymous Coward

      Re: HMRC

      HMRC doesn't store any data outside of UK data centres. The issue here was that Government Gateway was affected by the global ripples of the US-EAST problem so taxpayers couldn't authenticate and therefore couldn't access HMRC

  6. Tron Silver badge

    DIY or rely.

    Terms and conditions. Vol. 4, p.632, Item 23:

    If AWS staff are trying their best, no compensation is payable.

    The way we used to do tech, on prem, was much more reliable. Even if someone nuked the East coast of America, your server room would keep whirring, and your stuff would just work.

    All the stuff GAFA flog us to ensure themselves a healthy income: SaaS, Cloud storage and AI, make our tech inherently less resilient and force us to pay regular fees to GAFA to exist.

    You'd like to think people would change after this. I'm sure 'lessons will be learned' but nobody will actually do anything differently. They will choose the lazy option and just keep paying the subs.

    If you want to depend on a third party for your music, pay for streaming. If not, buy CDs. Ditto for enterprise level tech. DIY or rely. Personally, I still buy CDs.

    1. elsergiovolador Silver badge

      Re: DIY or rely.

      The way we used to do tech,

      Ever been to an AWS sales meeting with a client?

      The old-school on-prem engineers just don’t have the same woo and bullshitting skills.

  7. Lawless-d

    Nice when you can directly blame.

    We had Windows Virtual Desktops going "poof" after 20 mins.

    Was it our Cisco VPN using AWS for logging?

    Was it VMWare phoning home and having no cloud db?

    Was it some other logging timing out?

    What are we going to do? Run TCPdump on our pipe, log all the IPs and work out of they are in AWS ? (well, we will now).

    If our intranet was down all morning we'd be in the boardroom explaining.

    Happily we can now just say "AWS was down" and everybody shrugs.

    1. chivo243 Silver badge
      Trollface

      Re: Nice when you can directly blame.

      Happily we can now just say "AWS was down" and everybody shrugs.

      I used to say: Happily we can now just say "It's a MS product" and everybody shrugs.

  8. Dan 55 Silver badge
    Pirate

    Amazon's US-EAST-1 region outage caused widespread chaos, taking websites and services offline even in Europe and raising some difficult questions.

    Yes, you're not wrong. So much for how seriously big business takes GDPR.

    1. elsergiovolador Silver badge

      GDPR was created for big business to legitimise personal data trade under guise of protecting privacy (classic gaslighting).

      Most users don't read what they consent to (courtesy of Cookie Law that trained them to click agree to anything).

      1. Dan 55 Silver badge

        GDPR != Cookie law.

        1. elsergiovolador Silver badge

          You missed my point. The Cookie Law wasn’t about privacy - it was behavioural conditioning. It trained people to click “Agree” without reading anything.

          GDPR was the next step: once users were preconditioned, it gave corporations a legal framework to collect and trade data with subject consent. Before GDPR, that kind of large-scale data trading was legally murky.

          1. Dan 55 Silver badge

            Well it trained me to click disagree without reading anything.

            1. elsergiovolador Silver badge

              For one person like you, there will be 9999 who won't even care what they clicked.

  9. Anonymous Coward
    Anonymous Coward

    Ho-Hum here we go again ... again ... again ... (Error Recursion level too high !!!)

    Multiple levels of 'Someone else’s computer/system/service' relying on each other !!!

    Each level is configured to be 'Good enough' as being 'better than good enough' costs more and needs more 'expensive' labour.

    Everyone ignores the obvious risks because it 'probably won't happen' ... when it does it is 'someone else’s fault' !!!

    Rinse, repeat and 'make bank' as the Americans say, supposedly !!!

    P.S.

    Don't ever learn any lessons ... because they cost money too.

    Also the original C-level architect(s) of these disasters, who needs to learn, are usually long gone setting in place the next disaster to come ('AI' perhaps !!!???)

    :)

    1. Cris E

      Re: Ho-Hum here we go again ... again ... again ... (Error Recursion level too high !!!)

      "Their guy HAS TO be better than our guy. Jeff's an idiot and it's their business so they MUST have hired someone better than him, maybe two someones."

  10. Richard Tobin

    So what was the problem?

    Yes, it was DNS. But that could mean anything. Did a server fail - in which case their redundancy is hopeless - or did someone just put the wrong data in - in which case they need to improve their procedures - or what?

    1. steeltape

      Re: So what was the problem?

      Exactly.

      "DNS failure" is not a root cause. What caused the failure?

      Did a queue fill up causing a queue writer to wait too long for an entry to vecome available?

      Did a program fail to detect an error, e.g. a buffer over-run, and crash?

      Did an IO error load a bad address into a pointer?

      Did a power failure gum up a transaction?

  11. kmorwath

    So the migthy AWS still has a single point of failure..

    ... in its original implementation because nobody bothered to replicate it across other regions - maybe because having authentication management close to Langley is a bonus? Ot it's jyst the proverbial sysadmin laziness? "It works [most of the time], don't touch it...."

  12. herberts ghost

    This points to the fundamental issue with cloud computing and "centralization"

    Cloud computing only increases your security attack surface. This of a wall of brains ... it only takes a cloud vendor's Igor the mess things up. You are still exposed to your error and have added the vendors (even thou they are generally very good. Another risk, though not in play today, is vendor lock in, don't use proprietary tools unless you have to and if you must ... do not use the cloud vendors or you have made them a monopoly.

    Centralization is another issue because AWS is so large that the likelihood that sites that you depend on use AWS is high. So if AWS gets sick, it affects a lot of sites and services at once. Detecting the true cause may be hard to deduce.

    1. Graham Cobb

      Re: This points to the fundamental issue with cloud computing and "centralization"

      Yes.

      Smart people know that - and still use AWS because the tradeoff is worth it. After all, if your own datacenter has no redundancy, AWS can be a major step up even though it has this failure mode.

      It is unlikely to be worthwhile for most commercial enterprises to get "better than AWS" reliability. Which is fine.

      What is not fine is that the people who do need to keep running (like banks) don't realise AWS is not good enough.

      1. Roland6 Silver badge

        Re: This points to the fundamental issue with cloud computing and "centralization"

        >” After all, if your own datacenter has no redundancy, AWS can be a major step up even though it has this failure mode.”

        Only if you are prepared to rearchitect your systems to support datacentre redundancy and pay the AWS premium to run these services.

  13. Taliesinawen

    Distributed single point of failure

    AWS's control plane being centralization in US-EAST-1 creates a dependency. When US-EAST-1 experiences issues it can cause global impacts because orchestration and control APIs become unavailable or degraded.

  14. Altrux

    TBTF

    AWS is now critical infrastructure for humanity. Too big to fail. And, oops, it failed. It only takes one small corner of AWS to create this much chaos, worldwide? We kid ourselves that we have conquered the resilience problem...

  15. Bluck Mutter

    When I was a lad...

    Or even an old geezer, I wrote code that relied on remote databases and in said environment you would have alternate service "pipes" available.

    So when a client process couldn't get a response to a request in a timely manner, it would try again and if that failed would try one of the alternate service "pipes".

    For mission critical stuff, you had an active-active topology which meant that some alternate service "pipes" would route to the primary database and some to the secondary database.

    And we are talking active-active systems with very large databases and 10,000's of active users.

    Under pining this was also comms connections that were physically seperate (i.e. the digger driver can only take out one link) and also used different comms vendors with radically different routing (in the physical sense).

    Of course, the above is an extreme (but necessary) topology for that app that is the life (and $$$$) blood of your company.

    My reading of this AWS issue is some relatively noddy (in terms of scale) but global dynamodb database couldn't be reached and if said database is so important then why aren't their multiple, synced copies in multiple regions/countries with multiple pathways to these copies and the software that access it being able to detect a failed request and "self heal" by trying other pathways.

    Kids today and all that but I sit here shaking my head that shit I and countless others did years ago with critical apps/databases has been lost to the current generation.

    Bluck

    1. Anonymous Coward
      Anonymous Coward

      Re: When I was a lad...

      The new generation of whizz-kids know better than the oldies/greybeards because they are whizz-kids.

      There is nothing to be learnt from the oldies because they are 'Old' & 'Stupid' !!!

      Old ideas are re-invented BUT better ... AKA old lessons need to be learnt again when they fail because your 'good idea' had failed in the past !!!

      Too much arrogance and not enough realisation that some things were done in the past for 'Good Reasons'.

      This 'generational cycle of doom' is too common and happens regular as clockwork ...

      :)

    2. Timop

      Re: When I was a lad...

      Being honest the kids today probably don't have the autonomy required for making any $$$$ calls within the projects.

      And even if they did, the companies probably laid off all the experienced people who could have passed their experience forward long time ago.

      And even they did not lay off everyone, the days and requirements (KPI etc) would probably be structured in a way that passing the information would be extremely challenging.

      A snake eating it's own tail and we are the ones forced to watch and experience it.

  16. Anonymous Coward
    Anonymous Coward

    AWS

    Your one stop shop for decentralisation.

  17. AVR Silver badge

    In case you need another reason to avoid 'smart' devices

    A bunch of them got buggy when they lost access to their home server due to this outage. There were people complaining of their smart beds being locked on 'hot' or 'cold'.

    1. Timop

      Re: In case you need another reason to avoid 'smart' devices

      Yeah it was pretty annoying to set up things local network only. Browsing through tutorials. Pulling device keys and installing custom offline libraries etc.

      And manually setting the rules.

      When manufacturer provides smartphone app that works out of the box.

      Actually it took only something like 3-4 evenings in total. Including setting up multiple devices and refining rules.

      But now worst that could happen is that our underfloor heating is on higher setting when electricity price is high. If the connections have been down so long that pre-fetched Nordpool prices get outdated.

  18. Wellyboot Silver badge

    from Randall Munroe too many years ago...

    https://xkcd.com/908/

  19. Hcobb

    In order to overcome the https://en.wikipedia.org/wiki/CAP_theorem you need a computer that has the emotional maturity needed to operate without a false assumption of certainty.

  20. ShingleStreet

    The root of the problem was poor network design…

    …and whilst DNS is being talked about in this instance, ALL components of the service chain require a network design which provides adequate redundancy. There is no point in an active-active backend fronted by ANY frontend/controlplane component which does not have redundancy.

    So DNS is not the root cause here. A poor network design which does not match the end to end service-level requirements is the root cause.

    SS

  21. richardnpaul

    One thing that I think most people realise when using software that supports S3 is that they often default to the global S3 bucket name endpoints, which are hosted by us-east-1. However, as I found out when debugging this, these endpoints can take up to 24hrs to resolve the bucket name in DNS. I was building an automated patch testing CI pipeline with ephemeral instances and was finding that it was failing due to the dns name for the buckets not resolving. The solution was to change the aws endpoint to use the regional names and endpoint which resolve near instantly but the defaults may have caught out a lot of people yesterday.

  22. joypar

    Centralisation in anything...

    ...always, always, always creates a single point of failure. Convenient, yes, undoubtedly; but fragile and an easy target for any enemy. In today's world that makes centralisation frequently a mistake.

    Consider data clouds.

    Consider power stations.

    Consider aircraft carriers.

  23. bboyes

    BSG: networks are an Achilles heel

    Every engineer and sysadmin needs to watch the Battlestar Galactica 2004 version. Plus there’s great Taiko drumming on the intro.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like