back to article A single DNS race condition brought Amazon's cloud empire to its knees

Amazon has published a detailed postmortem explaining how a critical fault in DynamoDB's DNS management system cascaded into a day-long outage that disrupted major websites and services across multiple brands – with damage estimates potentially reaching hundreds of billions of dollars. The incident began at 11:48 PM PDT on …

  1. DarkwavePunk Silver badge

    Ouch

    As someone with some levels of AWS certification, that's terrifying. Mostly because I understood most of that gobbledygook explanation. Pass the mind bleach.

    1. Like a badger Silver badge

      Re: Ouch

      Not until you explain to mere mortals what exactly a "DNS Planner" is, and why it's needed in simple terms. It'll help if you assume (accurately) I know nothing about hyperscale or cloud operations.

      1. DarkwavePunk Silver badge

        Re: Ouch

        I think it's part of their Route 53 service. I think it's to do with load balancing and DNS routing for latency. Read: "Voodoo".

        1. Like a badger Silver badge

          Re: Ouch

          Hmmm, I though route 53 was the bus service from Wolverhampton to Bilston, via Wednesfield. So I'm afraid there's no mind bleach for you, sir!

          1. Paul Herber Silver badge

            Re: Ouch

            Could you explain then how buses reroute around the gas main roadworks near the Cleveland Arms without dynamic DNS? You can't allocate new drivers just like that!

            1. Like a badger Silver badge

              Re: Ouch

              "Could you explain then how buses reroute around the gas main roadworks near the Cleveland Arms without dynamic DNS? You can't allocate new drivers just like that!"

              Hickman Avenue, mate! And that's far better that the current route if you're going to the dogs at Monmore Green stadium, obviously misses out the Cleveland Arms altogether, but since it's just got approval for conversion to a Toby Carvery it'll be well worth a miss.

              1. This post has been deleted by its author

          2. DarkwavePunk Silver badge

            Re: Ouch

            Could be worse I guess. I raise you the V3 to Burton via Willington.

            1. that one in the corner Silver badge

              Re: Ouch

              > I raise you

              Mornington Crescent!

              1. DarkwavePunk Silver badge

                Re: Ouch

                Cunt (in the nicest possible way). Well played.

          3. martinusher Silver badge

            Re: Ouch

            Bellevue to Trafford Park via Moss Side

          4. An_Old_Dog Silver badge

            Re: Ouch

            Your data has encountered a propagation delay.

            Route 53 is the Arctic-Allen run; it starts at Milikan Way and loops down around Allen Boulevard near Arctic Avenue.

            (Scary sentence: "Your system may have a race condition.")

          5. MyffyW Silver badge

            Re: Ouch

            Your mention of Wolvo and Bilston has just given me PTSD to a datacentre outage from more than a decade ago. Thanks for that :-)

        2. Doctor Syntax Silver badge

          Re: Ouch

          Read: "Voodoo"

          In this case, Voodon't

        3. Anonymous Coward
          Anonymous Coward

          Re: Ouch

          “ The race condition occurred when one DNS Enactor experienced "unusually high delays" while the DNS Planner continued generating new plans. A second DNS Enactor began applying the newer plans and executed a clean-up process just as the first Enactor completed its delayed run. This clean-up deleted the older plan as stale, immediately removing all IP addresses for the regional endpoint and leaving the system in an inconsistent state that prevented further automated updates applied by any DNS Enactors.”

          I so want this to be Vibe coded, AI or July’s AWS job cuts cause.

          1. Anonymous Coward
            Anonymous Coward

            Re: Ouch

            "I so want this to be Vibe coded, AI or July’s AWS job cuts cause."

            Nah, probably just plain old stupidity.

            Ok, "stupidity" might be a little harsh. Reading about how all the failures cascaded makes me believe that the dependency graph of services has got to be a big ol bowl of spaghetti.

        4. mr-slappy

          Route 53

          Is that a replacement bus service by any chance? Maybe that's why it was so slow..

      2. toejam++

        Re: Ouch

        Route 53 is a dynamic DNS service. Instead of the usual static A record mappings of hostnames to IP addresses, it allows you to configure pools of IP addresses with various priority/failover and load-balancing schemes in play. The "DNS Planner" is a subsystem that reaches out and queries the health and status of the resources that use those IP addresses, either via an agent that runs on the cluster/box or by probing the object, so it can adjust what IP addresses are in play.

        For those of you familiar with F5 Global Traffic Manager (formerly 3DNS) and Local Traffic Manager (formerly BIG-IP), it appears to a bit like the big3d daemon.

        I imagine that if that subsystem was hosed, it would result in the DNS service believing that local area resources (load-balancer VIPs, stand-alone servers, etc...) were unhealthy/offline (assumed or otherwise). That's really bad, especially if you don't have a static IP address of last resort configured for a hostname, because the DNS service will just stop offering IP addresses when something makes a DNS query.

        1. This post has been deleted by its author

        2. werdsmith Silver badge

          Re: Ouch

          It must be part of the system that changes the DNS for your EC2 when you shut it down to save hours (and therefore money) unless you cough up to have a persistent name.

        3. Mike Pellatt

          Re: Ouch

          You were doing so well with that explanation until you used that awful phrase "reaches out"

          1. Mrs Spartacus

            Re: Ouch

            Ah yes, but at least he didn't circle back, be thankful for small mercies.

        4. JWLong Silver badge

          Re: Ouch

          Otherwise known as a clusterfuck(literally!).

        5. disk iops

          Re: Ouch

          > AWS has disabled the DynamoDB DNS Planner and DNS Enactor automation worldwide until safeguards can be put in place to prevent the race condition reoccurring.

          Which comes back to trying to get cute with DNS resolution. Clients have a NASTY tendency to hold onto records forever which causes highly lumpy inbound access to services. What's hilarious is Amazon didn't copy what Akamai does with their load-balancers. The whole "let's be cute" by screwing with the tables and updating the tables, and jiggering with weights is can be reduced to a simple level. And in the mean time your DNS records can be STATIC, and end-point alive/dead detection can be slow-moving and localized to the LB in question that owns the range.

    2. Stevie Silver badge

      Re: Ouch

      The new learning fascinates me Sir DarkwavePunk.

      Explain once more how we may employ sheep's bladders in the prevention of earthquakes.

  2. that one in the corner Silver badge

    Recovery wasn't rate limited?

    One thing missing in that description was any attempt to apply rate limiting to, well, any part of it.

    So a huge pile of machines basically all try to come up at once, without the staggering that limiting would cause (or inflict, depending upin p.o.v.) and start getting into a mess.

    Is this genuinely a surprise to anybody? Isn't everyone charged with engineering a system supposed to be asking "what happens if it all switches on at once?" no matter what the cause might be? From checking whether recovery from a power failure[1] means the hard drives[2] can be allowed to spin up all at once[3], to whether you can serve netboot images fast enough to prevent watchdog reboots or even how many DNS leases can you serve out before you are swamped just handling renewals because you KNOW you set the lease period way shorter than the DNS designers ever expected[4]

    [1] or Lady Florence pushing the Big Red Switch on Opening Day, not realising this one isn't a dummy

    [2] or the dynamos, each racing to be the master frequency the rest have to sync to

    [3] even in your home lab, can the circuit take that strain

    [4] I *think* I understand what was going on here, allowing machine identities to move around as hardware becomes available to handle requests for user operations (please, if anyone can correct that understanding, do so) but is that how people normally do load balancing? Not my area at all, but this really feels like a misuse of DNS.

    1. Anonymous Coward
      Anonymous Coward

      Re: Recovery wasn't rate limited?

      really feels like a misuse of DNS.

      I also got the impression low TTL DNS records were being used to route traffic which I would think at this scale really is playing with fire.

      One of the advantages of IPv6 is that you have such a large address space that each end point could be assigned a permanent address along with any number of ephemeral addresses which could be assigned to processes, applications etc and follow them as they migrate around the cluster and hopefully push the routing out to the network and dynamic routing processes.

      I suspect much of this stuff is a dark art simply because many of its practitioners have avoided enlightenment. ;)

      1. Anonymous Coward
        Anonymous Coward

        Re: Recovery wasn't rate limited?

        Even worse is low TTL dhcp leases. I can understand 3h for customer instances, but for the instances that /are/ AWS, I would have expected them to be longer. Surely a system that bills for uptime would be able to purge dhcp entries for terminated instances.

      2. theblackhand

        Re: Recovery wasn't rate limited?

        I would suggest you are looking at the problem from the wrong direction. The issue isn't existing DNS mappings. They work.

        It's new mappings. You have to be able to create/delete records to flex services up

        /down/between data centres (US-EAST-1 is a collection of around 100 large data centres) and each new instance that is required to cope with increased load or the migration of load between your capacity groupings (i.e. a data centre hall is likely the smallest grouping)

        Once your DNS move/add/delete process is delayed, demand will create a situation where key services reach capacity and then you enter the downward spiral of no capacity to cope with current load and no ability to increase capacity.

        This ignores any systems used to avoid this situation (DNS planner and DNS Enactor) - my assumption is that something triggered the DNS issue such as maintenance/power outages causing a loss of data centre capacity causing some of the initial demand issues, because historically, that has been the cause of a large number of previous US-EAST-1's outages.

        It's worth noting that a number of AWS people have said that US-EAST-1 is too big to be stable BUT customers want it and it provides valuable data for how to run other AWS regions reliably as they have been built to avoid the extreme scale issues US-EAST-1 has. Ref: https://www.theregister.com/2024/04/10/aws_dave_brown_ec2_futures/ and

  3. Steve Graham

    is that how people normally do load balancing?

    In my career (of distant memory) we'd have a layer of boxes behind a single IP address that distributed "work" one layer down to the servers.

    1. Nate Amsden Silver badge

      Re: is that how people normally do load balancing?

      Depending on the situation yes, most often DNS is used for load balancing across multiple sites, though more commonly such load balancing is used for geo traffic distribution. Internal load balancing with DNS only is relatively rare, but some things do leverage it. Amazon has a history of layering DNS on top of DNS to mitigate(?) issues with them wanting to rotate IP addresses on some things (most commonly an issue with ELB(ironically? you can find old articles where customers got flooded with traffic for other customers due to DNS cache issues after AWS changed their ELB IPs), and I think RDS). Fortunately haven't had to deal with AWS myself in over a decade so stress levels are much less since I run my own stuff that is super stable.

  4. Anonymous Coward
    Anonymous Coward

    I am vaguely reminded of an incident from my past

    PHB decided that machines left on overnight is bad.

    Everyone switches machine off, goes home.

    Gets to work. Powers up. Machine needs to download and install an update.

    3,000 machines and nearly 24 hours later when no one has done a days work, it's ITs fault.

    1. Anonymous Coward
      Anonymous Coward

      Re: I am vaguely reminded of an incident from my past

      That's the funniest thing I've read in a while, superb!

      You have to hand it to the ignorant PHBs who will gladly shout the techies down, implement some dictat or we get fired, we do as commanded and still get a bollocking 'cos PHB is desperately trying to keep the C-suite from chewing his arse to pieces over whatever his latest cock-up was!

      1. James 139

        Re: I am vaguely reminded of an incident from my past

        And the worst of them always manage to keep it verbal.

        Always get it in writing!

    2. Tron Silver badge

      Re: I am vaguely reminded of an incident from my past

      Machine is forced to download and install an update.

      FTFY.

      1. Doctor Syntax Silver badge

        Re: I am vaguely reminded of an incident from my past

        Nailed it.

        Switching off unused machines, good (OK, I'm old enough to think saving energy is a Good Idea).

        Having uncontrollable forced download & update bad. Whoever thought of such nonsense?

        1. Claptrap314 Silver badge
          Linux

          Re: I am vaguely reminded of an incident from my past

          I wasn't Linus, that's for sure...

    3. Boris the Cockroach Silver badge
      Pint

      Re: I am vaguely reminded of an incident from my past

      Been there, done that , been shouted at

      More beer to wipe away the memories of being yelled at to get the job done ASAP while the PC sat there doing spinny wheely things.......

      1. werdsmith Silver badge

        Re: I am vaguely reminded of an incident from my past

        It is IT’s fault.

        They are not using update management tools.

        When updates are accepted for desktop machines is something that can be 100% controlled by your own IT function.

    4. Anonymous Coward
      Anonymous Coward

      Re: I am vaguely reminded of an incident from my past

      The day after patch Tuesday with everyone turning on their laptops, along with a Europe-wide VPN and a single corporate-wide web proxy is just as bad, believe me.

      Yes, the design didn't scale up to what was being asked of it.

      1. Anonymous Coward
        Anonymous Coward

        Re: I am vaguely reminded of an incident from my past

        > The day after patch Tuesday with everyone turning on their laptops, along with a Europe-wide VPN and a single corporate-wide web proxy is just as bad, believe me.

        Try a 2 week Christmas shutdown and everyone turning back on the first working day in January and the AV systems deciding that it was too long for a delta and instead apply a full definitions update.

        Burning WAN links and network guys screaming at us for using their precious bandwidth....

        And yes we did put mitigations/throttling in place after the first time !

  5. Anonymous Coward
    Anonymous Coward

    bed is stuck

    I wanna know when my fucking bed will start working again!

    1. Gerhard den Hollander

      Re: bed is stuck

      You have a special bed for fornicating ?

      Maybe you can use your sleeping bed for the time being ? Or the couch ...

      1. Anonymous Coward
        Anonymous Coward

        Re: bed is stuck

        https://www.msn.com/en-us/news/other/this-weeks-aws-crash-made-smart-beds-overheat-get-stuck-in-wrong-position/ar-AA1P0rN1

        1. Doctor Syntax Silver badge

          Re: bed is stuck

          Whoosh

        2. steelpillow Silver badge

          Re: bed is stuck

          Schadenfreude

        3. mr-slappy

          Re: bed is stuck

          I, for one, welcome our robot bed overlords

      2. Like a badger Silver badge

        Re: bed is stuck

        You have a special bed for fornicating ?

        Maybe you can use your sleeping bed for the time being ? Or the couch ...

        Or a sock.

        1. Korev Silver badge
          Coat

          Re: bed is stuck

          Socks4 or Socks5?

          1. MatthewSt Silver badge

            Re: bed is stuck

            Just steer clear of Gopher

            1. AtomicWombat

              Re: bed is stuck

              "Just steer clear of Gopher"

              Don't worry, pretty much everyone not being paid by the University of Wisconsin to work on it already did

    2. Anonymous Coward
      Anonymous Coward

      Re: bed is stuck

      I have to question what state humanity is in when we have to have a f**king bed connected to the world's biggest network 24/7 or it won't work!

      1. Doctor Syntax Silver badge

        Re: bed is stuck

        What state is humanity in when a bed has to "work" at all. Existence should be sufficient.

        1. werdsmith Silver badge

          Re: bed is stuck

          A world that contains a variety of people, some with conditions that make a standard bed difficult to use.

          1. Tim Kemp

            Re: bed is stuck

            Certainly here in the UK one does not need to have an internet connection to have an adjustable or assistive bed...

            It's crazy how much shit depends on cloud connectivity.

            1. werdsmith Silver badge

              Re: bed is stuck

              The beds actually do have the capability to work off a local control handset. But cloud is the cheapest and easiest way to enable voice control, by simply using Alexa or similar.

              If you need one of these special beds then there's a chance that voice control will give you independence.

              Some like to overlook that the bed doesn't actually depend on cloud in order to trigger the gullible. Maybe think it through first before filling another pair of Tena pants.

      2. Anonymous Coward
        Anonymous Coward

        Re: bed is stuck

        They told me it was like sleeping the clouds!

      3. ajadedcynicaloldfart

        Re: bed is stuck

        @Curmudgeon in Training

        You didn't mention the subscription that goes with it...

        What a world we live in when you have to pay a subscription fee to use your fucking bloodybollockybuggery bed working as intended.

        https://help.eightsleep.com/en_us/can-i-cancel-autopilot-Bkbr7s9rn

        1. Anonymous Coward
          Anonymous Coward

          Re: bed is stuck

          > https://help.eightsleep.com/en_us/can-i-cancel-autopilot-Bkbr7s9rn

          Autopilot, on a f*cking bed? What could go wrong.

          Stop the world, I want to get off now.

          1. Anonymous Coward
            Anonymous Coward

            Re: bed is stuck

            > Autopilot, on a f*cking bed? What could go wrong

            "Not tonight dear, I have a headache. Why don't you engage the autopilot."

        2. isdnip

          Re: bed is stuck

          And the eightsleep web site doesn't actually quote prices, though it allows you to ask for sales to let you know, which itself should be a red flag.

  6. bsdnazz

    Too big to turn on?

    So when DNS was fixed there were so many services trying to restart many of them failed.

  7. breakfast Silver badge

    Looks like they still didn't catch the cause

    The report explains what the race condition was, but not why the Enactor was running so slowly in the first place, which was technically the cause of the problem. I wonder whether they know - a challenge of work at this scale is that you can have problems that only happen with production workloads and it's hard to reproduce that and properly isolate the cause.

    1. steelpillow Silver badge
      Facepalm

      Re: Looks like they still didn't catch the cause

      A cascade of FFS! causes:

      1. The enactor should not have been running slow.

      2. Nothing to monitor its status.

      3. The various boxen had no idea what to do about it before jumping in feet-first, cuz nobody had thought through Murphy's Law.

      4. Someone who knows what they are doing should have been retained with sweeteners, not driven out by bullying manglement.

      I expect there are several more.

      1. tfewster Silver badge
        Facepalm

        Re: Looks like they still didn't catch the cause

        Rule 0 for an Enactor (DNS or any other type of Enactor): Check if anything has the system "locked" before making changes.

        1. O'Reg Inalsin Silver badge

          Re: Looks like they still didn't catch the cause

          Async computing 101, atomic modifications. The same category of whoopsie that triggered the British Post Office Limited catastrophe.

        2. richardcox13

          Re: Looks like they still didn't catch the cause

          Distributed systems at sufficient scale need to handle Byzantine Generals.

          Which is not easy. And while the design may be good, ensuring implementation over years (and decades) of maintenance still achieves its robustness is even less easy.

          1. David Hicklin Silver badge

            Re: Looks like they still didn't catch the cause

            > And while the design may be good, ensuring implementation over years (and decades) of maintenance still achieves its robustness is even less easy.

            Not to mention "minor" changes to the code which as pointed out elsewhere are impossible to test at scale----until now and the test failed.

            Now just imagine some weird scenario that takes out ALL of AWS or Azure and how long it would take to recover.

            Business Continuity Plans anyone ? I think the only one for the cloud is Pray

          2. breakfast Silver badge

            Re: Looks like they still didn't catch the cause

            I was not familiar with Byzantine Faults, this has introduced me to a whole exciting new realm of problems that can easily arise in async systems...

      2. Anonymous Coward
        Anonymous Coward

        Re: Looks like they still didn't catch the cause

        4. Probably booted out of the door in July’s AI driven AWS job cuts.

      3. David Hicklin Silver badge

        Re: Looks like they still didn't catch the cause

        > 2. Nothing to monitor its status.

        And each monitor adds its own load and race condition - where do you draw the line? What monitors the monitors??

        1. Rob Daglish

          Re: Looks like they still didn't catch the cause

          I believe Blackboard Monitor Vimes...

    2. Doctor Syntax Silver badge

      Re: Looks like they still didn't catch the cause

      If it's a race condition "slow" is a relative term. It just means slower than Planner could handle.

      1. breakfast Silver badge
        Thumb Up

        Re: Looks like they still didn't catch the cause

        Sure, but given that this hasn't happened before there must be a threshold there and there must be a reason that line was crossed this week and not before.

    3. Claptrap314 Silver badge

      Re: Looks like they still didn't catch the cause

      That is very, VERY rarely the case. Yes, I started in microprocessor validation, but there are still ways to slow down a box. Take such a box, put it on an isolated network, and hook it up to a large number of not-slow boxes & let them go to town.

      I get that most people would not have been exposed to such solutions. The FAANG engineers absolutely. Wait. I guess those are agents now....

  8. pfalcon

    Asynchonous Programming is HARD

    As soon as you have more than one or two operations happening at the same time, being (re)triggered (by external sources), or just operating at scale, then dealing with all of the permutations and combination is rather complex - to say the least. Even supposedly simple systems can be mind boggling sometimes.

    At this point earlier articles about a brain drain in AWS make a lot of sense. Experienced developers/network admins would understand these complexities, and probably not roll out "quick changes" to anything without a thorough review (including in-house experience about parts of the system that are more prone to issues...etc). Juniors are more likely to feel the pressure to deliver (and kowtow to a PHB), and roll something out without understanding the implications. Worse, if someone was cutting and pasting simplistic code generated from AI...

    1. retiredFool

      Re: Asynchonous Programming is HARD

      I can just imagine the junior guy consulting with the AI frantically asking what do I do, what do I do? The AI helpfully responds with, "try turning the power off and on to reboot everything". I mean that is the uSoft script I think. And the AI scrapes the most probable answer!

      1. Sudosu Silver badge

        Re: Asynchonous Programming is HARD

        "The override. Where's the override?"

    2. Doctor Syntax Silver badge

      Re: Asynchonous Programming is HARD

      "Juniors are more likely to feel the pressure to deliver (and kowtow to a PHB)"

      I've often said paranoia is the first requirement of a database manager. I'll add "ability to terrify PHBs" to that.

    3. yoganmahew

      Re: Asynchonous Programming is HARD

      @pfalcon

      Came here to say that about asynchronous!

      Every time you have two processes running with the same target, it is inevitable that they will at some stage process out of order; not just a chance, inevitable.

      If you don't have a way to know that is happening, or a way to deal with what happens when it does, you shoulld stick to synchronous locks.

  9. Csmy

    I cant see any explanation as to how this knocked out services nowhere near us-east-1

    1. Claptrap314 Silver badge

      That was mentioned earlier. AWS global services are serviced in us-east-1. So...IAM. And the global region for anything.

      1. Excused Boots Silver badge

        Fair enough, so this ‘embrace the cloud because of redundancy and no single point of failure’, is possibly not as true as might be thought. No?

        1. David 132 Silver badge

          With apologies and due credit to whoever wrote this a few days ago: "The Cloud is just distributed Single Points of Failure".

        2. Claptrap314 Silver badge

          Certainly not for AWS. Remember, they started out selling excess capacity after the Christmas rush. Their resilience has mostly be reactive & after-the-fact. I learned SRE at Google. Resilience there was critical from the get-go, so I expect that it should be easier to achieve HA there, but I've not actually worked that side yet to have any idea what it's like for GCP.

  10. Nate Amsden Silver badge

    So maybe that's why

    Some folks have DNS outages, automation messing things up.

    All critical DNS entries on my systems have always been manually managed, and the majority of them can go for many years without needing any changes. IPs are statically assigned across the board, and lifetimes of systems are measured in years again. Slower change = less times things can go wrong. There is some automation for folks who wish to create new systems that DNS entries can be created automatically (I don't leverage this myself but others use it), but there has never been anything that modifies existing DNS entries (worst case you get a duplicate DNS entry if DNS/IPAM doesn't agree when building a new system but that only impacts that one new system).

    Some people strive to automate everything, with cloud the automation needs are quite a bit higher as there is far more complexity. I prefer simpler systems, generally have less automation, my mantra is more if your automation saves a bunch of time in the long run then that's fine could be good to do it. But if you spend a lot of time creating, maintaining and testing that automation that consumes as much time as doing it manually(or even close to as much time) then don't bother.

    1. Claptrap314 Silver badge

      Re: So maybe that's why

      I feel like we have clashed a bit over this in the past. What I believe you are failed to note is the scale of AWS. My semi-educated guess is that they have between 10 million and 100 million servers. At that scale, "Doing it manually" really must be avoided like the plague, not just because of the cost involved, but because but because meat generally sucks at getting things perfect out of the gate. (Just check my posts for typos, for instance.)

      It appears that you are running a tight ship, which is great. But it is a tiny, tiny ship compared to what AWS has.

      1. Excused Boots Silver badge

        Re: So maybe that's why

        Ok yes, but your point is.....?

        Ok fine, best will in the world, 'shit happens’. But if it happens on-prem, it’s a bad day for you and your company, fine, damage is fairly limited.

        Same thing happens to AWS, or Azure etc, and.......

        1. Claptrap314 Silver badge

          Re: So maybe that's why

          My point is that even for a mid-sized shop, the rule is "automate everything". For AWS it should be more of an iron-clad requirement. Nate's experience is simple not applicable to the matter at hand.

        2. Joseba4242

          Re: So maybe that's why

          Actually, if you have a problem on prem, in the eyes of the customers and, if it's big, the press, its YourCompany that's the problem.

          If the problem is AWS, then it's AWS that's the problem.

          That's a huge reputational difference.

          And please don't tell me that sufficiently complex on-prem systems don't have similar length outages. We just don't discuss them here.

      2. A Non e-mouse Silver badge
        Headmaster

        Re: So maybe that's why

        In $DAY_JOB I frequently run up against smart arses who say "Why do you do it like that? I do this, it works and is super easy."

        What those idiots forget is they are looking after a couple of VMs/Servers/Switches/etc. Come back to me with your "ideas" once you've got experience of running thousands of any of them. (And the company and all its employees depend on them all working)

  11. steviebuk Silver badge

    But

    the cloud is amazing, the cloud is resilient, the cloud never goes down as you just switch to another zone/region which isn't MASSIVELY expensive.

    Its all wank.

  12. ecofeco Silver badge
    Facepalm

    How the hell?

    Never mind.

  13. hx

    If I read that correctly, AWS engineers do not understand what they implemented

    At least, not until it fell over. Now I need to re-evaluate the assurances that it just isn't possible for someone at Amazon to nuke all data with a single errant change.

    1. A Non e-mouse Silver badge

      Re: If I read that correctly, AWS engineers do not understand what they implemented

      Anything at Amazon/Google/Microsoft/etc scale is going to be complex. You will never fully understand it.

  14. amacater

    Someone else's computer, someone else's rules - YOUR data, YOUR business

    This is instructive - run, don't walk away from major major cloud providers. Still cheaper, quicker, less dangerous to do this yourself if you need five nines and do it on prem if that's contained and what's needed.

    1. Anonymous Coward
      Anonymous Coward

      Re: Someone else's computer, someone else's rules - YOUR data, YOUR business

      Most businesses _really_ don't need five 9s though. Four, maybe. Three, definitely.

      If your business can't cope with one of your IT systems having an unexpected lie down for 6 minutes over the course of a year, you're either doing something incredibly niche and safety-critical, or you're over-reliant on that system.

      1. richardcox13

        Re: Someone else's computer, someone else's rules - YOUR data, YOUR business

        or you're over-reliant on that system.

        or you've built processes that are so lean that they are just fragile.

      2. David Hicklin Silver badge

        Re: Someone else's computer, someone else's rules - YOUR data, YOUR business

        > Most businesses _really_ don't need five 9s though. Four, maybe. Three, definitely.

        They need it whilst production is running

        The IT systems can have a lie down at other time, we just need a way of scheduling the failures

      3. Claptrap314 Silver badge

        Re: Someone else's computer, someone else's rules - YOUR data, YOUR business

        I agree that the bulk of businesses don't need five nines. Most that do will also be on-prem for their business-critical requirements (manufacturing facilities, hospitals). I also KNOW that emergency dispatch in many locations were using Hangouts when I was an SRE. Mental health hotlines will be very similar. I expect that stock borkers (intended) also really need 5+ nines.

        Anyone doing retail sales REALLY doesn't want to be down for even fifteen minutes on Cyber Monday--you can make a case for 5 nines if you are in retail.

        But yes, if your business can be down for a full day with no serious consequences, you only need three nines, and your cloud spend will be expensive. But less expensive than a dedicated sysadmin if you are a small business/startup. At my last company, with roughly 100 people, had an AWS + Heroku bill of $5000/mo when I started. We were a medical services middle-man, so we really did need better than four nines as well.

  15. Anonymous Coward
    Anonymous Coward

    Thank you commentards

    Out of the tragedy of AWS and things going global titsup come many comments of great humour from my fellow commentards to lighten the day.

    Thank you one and all.

  16. Fruit and Nutcase Silver badge
    Joke

    DynamoDB?

    Amazon should have gone for the more efficient and reliable AlternatorDB

    1. David Hicklin Silver badge

      Re: DynamoDB?

      One that works in two directions ?

  17. Andy3

    Thank goodness our proposed bright, shiny and new 'digital ID' system will be completely immune from any similar effect, eh? Thank the Lord that the system will NEVER go down and leave millions of citizens stranded without access to their money, ID, proof of driving licence and insurance, disappeared hospital appointments and probably no way to fuel their car as cash pumps will have been banned by The Ed Milli Band.

  18. Clanker v1.01

    In Techie-Speak "An unrecoverable race condition"

    In pictures "A Slinky on an Escalator, falling over perpetually"

  19. Frank Leonhardt
    Flame

    Amazon took a robust system and added failure modes

    DNS is distributed - always was. BIND allows delegation to multiple autonomous redundant DNS servers which are distributed around the place so they won't all go at once. At one time (still?) the German registry insisted a domain's glue records had to be on separate subnets for redundancy. They thought it out. Only if you lost ALL the root servers couldn't you knock the system out, and there are loads - located on different continents. The load was balanced by virtue of multiple DNS servers dealing with a small number of requests, managed by the domain administrators. Any failure would be isolated to the domains in question. The Unix peeps at Berkeley knew what they were doing.

    Then some idiot thought "Let's set up a single point of failure and get lots of people to outsource their DNS to it, and lets manage it using a complicated database system and stuff it in our cloud so we'll get lots of money. Never give a sucker an even break"

    So do you blame Amazon, or the management fools that rushed to the "cloud" because they fell for the marketing hype?

  20. JohnnyS777

    Not mine. But is it ever relevant...

    It's not DNS

    There's no way it's DNS

    It was DNS

  21. naive

    Maybe this issue reveals some design flaws

    It may be efficient to design the cloud like a pyramid on its head, a more confederated solution would be more robust.

    Perhaps cloud giants like MSFT, GOOGL and AMZN should have a look at their cloud designs and divide their empire into independent regions. It is not good when a single bug or typo reverts time to the 1950's.

    1. CoyoteDen

      Re: Maybe this issue reveals some design flaws

      They are divided into regions. US-1-EAST is the only AWS region that went down, but there is a lot of stuff on US-1-EAST

  22. Elongated Muskrat Silver badge

    So, in summary,

    The people running AWS have never heard of a mutex?

    1. CoyoteDen

      Re: So, in summary,

      Uh huh, I've seen similar things happen on a much smaller scale where I work, after cleaning up the mess I found the guy responsible on Teams and sent him

      "it's called a lockfile, USE ONE."

    2. disk iops

      Re: So, in summary,

      the "skill level" at Amazon is highly uneven. DynamoDB being one of the original services it used to be run by top-shelf talent. Similarly S3. But even 10+ years ago the brain drain was in full swing, it was reduced to monkeys from India (H1b natch) desperately trying to understand how the system worked and how to keep all the plates in the air and spinning. Said monkeys had never run a complex system, let alone seen one, and were entirely inadequate to the task. (This was when Jasse was head of AWS, several years before his promotion to CEO)

      The S3 service for example relies on the notion of eventual consistency for customer data. So it has large leeway in how things are presented to the user. DynamoDB follows similar "eventual consistency" but on much shorter timescales. People FORGIT or IGNORE these realities of these 2 services and treat them as ATOMIC with immediate READ-AFTER-WRITE consistency. From there all kinds of hilarity ensues.

      It took 3 days for the S3 ecosystem (back then a mere 300,000 nodes) to coalesce and assemble a 'system image' as distinct pods or spheres gradually merged and updated their maps of "who is my neighbor".DynamoDB works similarly. BUT their code-base maybe VERY outdated. It is common to find teams in AWS using a dozen different versions of the main tooling because they can't be arsed to stay current or even moderately so. So the race-condition between the 2 update/execute *may* have had at its root, divergent code-bases and thus behaviors.

      IMO the constant fudging with DNS is an outgrowth of not understanding how the system was never intended to be contorted into how AWS uses it, and I suspect they could have used SLOW (relatively speaking) state-map updates to keep the map churn at a reasonable level. People who use AWS, and frankly internally too, don't seem to understand that INVALID DATA is FINE. Wait a bit and try again, or try the alternate answers. But no, they keep trying to achieve "perfection" in accuracy and you get these bolted on "improvements" which wreak havoc when some of the fundamental assumptions prove to be incorrect.

  23. CoyoteDen

    TOCTOU strikes again.

    From the postmortem:

    "The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing."

    time-of-check, time-of-use bug.

    You can fix this in one of two ways: Either check immediately before every change to make sure something hasn't updated it behind your back, or put a lock on it at the start so nothing can.

    1. disk iops

      Re: TOCTOU strikes again.

      DynamoDB is *eventual* consistency. Your solution only works if the backend is ACID which DDB is not. That said there are still solutions available.

  24. CoyoteDen

    But the real failure mode here is...

    THE ACTIVE CONFIG COULD BE DELETED.

    That should never happen.

    The right way to do it would be to copy the active config, insert and delete on the copy, then once nobody is holding a lock on it you sanity check it. If there is nothing obvious (like it being completely empty!) you switch to it.

    You also keep the previous config so you can quickly fail over if things crash.

  25. le_gazman

    Are the words "race" and "rare" being mixed up here?

  26. martinusher Silver badge

    Nice they figured it out

    ...and they told everyone. I daresay a permanent fix will follow. But.....

    We're being asked to buy stuff that's centered around an always on Internet connection that doesn't really need this, its just a gimmick to collect data for sale to information brokers. I find it profoundly annoying that, for example, my house thermostats have to converse through a a series of remote servers for me to be able to access them remotely (that is, from just across the ***!!** room). This is inherently bad, unstable, design -- you'd never design an industrial plant like this so why inflict it on people at home? (Greed.....of course.....)

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon