Ouch
As someone with some levels of AWS certification, that's terrifying. Mostly because I understood most of that gobbledygook explanation. Pass the mind bleach.
Amazon has published a detailed postmortem explaining how a critical fault in DynamoDB's DNS management system cascaded into a day-long outage that disrupted major websites and services across multiple brands – with damage estimates potentially reaching hundreds of billions of dollars. The incident began at 11:48 PM PDT on …
"Could you explain then how buses reroute around the gas main roadworks near the Cleveland Arms without dynamic DNS? You can't allocate new drivers just like that!"
Hickman Avenue, mate! And that's far better that the current route if you're going to the dogs at Monmore Green stadium, obviously misses out the Cleveland Arms altogether, but since it's just got approval for conversion to a Toby Carvery it'll be well worth a miss.
This post has been deleted by its author
“ The race condition occurred when one DNS Enactor experienced "unusually high delays" while the DNS Planner continued generating new plans. A second DNS Enactor began applying the newer plans and executed a clean-up process just as the first Enactor completed its delayed run. This clean-up deleted the older plan as stale, immediately removing all IP addresses for the regional endpoint and leaving the system in an inconsistent state that prevented further automated updates applied by any DNS Enactors.”
I so want this to be Vibe coded, AI or July’s AWS job cuts cause.
"I so want this to be Vibe coded, AI or July’s AWS job cuts cause."
Nah, probably just plain old stupidity.
Ok, "stupidity" might be a little harsh. Reading about how all the failures cascaded makes me believe that the dependency graph of services has got to be a big ol bowl of spaghetti.
Route 53 is a dynamic DNS service. Instead of the usual static A record mappings of hostnames to IP addresses, it allows you to configure pools of IP addresses with various priority/failover and load-balancing schemes in play. The "DNS Planner" is a subsystem that reaches out and queries the health and status of the resources that use those IP addresses, either via an agent that runs on the cluster/box or by probing the object, so it can adjust what IP addresses are in play.
For those of you familiar with F5 Global Traffic Manager (formerly 3DNS) and Local Traffic Manager (formerly BIG-IP), it appears to a bit like the big3d daemon.
I imagine that if that subsystem was hosed, it would result in the DNS service believing that local area resources (load-balancer VIPs, stand-alone servers, etc...) were unhealthy/offline (assumed or otherwise). That's really bad, especially if you don't have a static IP address of last resort configured for a hostname, because the DNS service will just stop offering IP addresses when something makes a DNS query.
This post has been deleted by its author
> AWS has disabled the DynamoDB DNS Planner and DNS Enactor automation worldwide until safeguards can be put in place to prevent the race condition reoccurring.
Which comes back to trying to get cute with DNS resolution. Clients have a NASTY tendency to hold onto records forever which causes highly lumpy inbound access to services. What's hilarious is Amazon didn't copy what Akamai does with their load-balancers. The whole "let's be cute" by screwing with the tables and updating the tables, and jiggering with weights is can be reduced to a simple level. And in the mean time your DNS records can be STATIC, and end-point alive/dead detection can be slow-moving and localized to the LB in question that owns the range.
One thing missing in that description was any attempt to apply rate limiting to, well, any part of it.
So a huge pile of machines basically all try to come up at once, without the staggering that limiting would cause (or inflict, depending upin p.o.v.) and start getting into a mess.
Is this genuinely a surprise to anybody? Isn't everyone charged with engineering a system supposed to be asking "what happens if it all switches on at once?" no matter what the cause might be? From checking whether recovery from a power failure[1] means the hard drives[2] can be allowed to spin up all at once[3], to whether you can serve netboot images fast enough to prevent watchdog reboots or even how many DNS leases can you serve out before you are swamped just handling renewals because you KNOW you set the lease period way shorter than the DNS designers ever expected[4]
[1] or Lady Florence pushing the Big Red Switch on Opening Day, not realising this one isn't a dummy
[2] or the dynamos, each racing to be the master frequency the rest have to sync to
[3] even in your home lab, can the circuit take that strain
[4] I *think* I understand what was going on here, allowing machine identities to move around as hardware becomes available to handle requests for user operations (please, if anyone can correct that understanding, do so) but is that how people normally do load balancing? Not my area at all, but this really feels like a misuse of DNS.
really feels like a misuse of DNS.
I also got the impression low TTL DNS records were being used to route traffic which I would think at this scale really is playing with fire.
One of the advantages of IPv6 is that you have such a large address space that each end point could be assigned a permanent address along with any number of ephemeral addresses which could be assigned to processes, applications etc and follow them as they migrate around the cluster and hopefully push the routing out to the network and dynamic routing processes.
I suspect much of this stuff is a dark art simply because many of its practitioners have avoided enlightenment. ;)
Even worse is low TTL dhcp leases. I can understand 3h for customer instances, but for the instances that /are/ AWS, I would have expected them to be longer. Surely a system that bills for uptime would be able to purge dhcp entries for terminated instances.
I would suggest you are looking at the problem from the wrong direction. The issue isn't existing DNS mappings. They work.
It's new mappings. You have to be able to create/delete records to flex services up
/down/between data centres (US-EAST-1 is a collection of around 100 large data centres) and each new instance that is required to cope with increased load or the migration of load between your capacity groupings (i.e. a data centre hall is likely the smallest grouping)
Once your DNS move/add/delete process is delayed, demand will create a situation where key services reach capacity and then you enter the downward spiral of no capacity to cope with current load and no ability to increase capacity.
This ignores any systems used to avoid this situation (DNS planner and DNS Enactor) - my assumption is that something triggered the DNS issue such as maintenance/power outages causing a loss of data centre capacity causing some of the initial demand issues, because historically, that has been the cause of a large number of previous US-EAST-1's outages.
It's worth noting that a number of AWS people have said that US-EAST-1 is too big to be stable BUT customers want it and it provides valuable data for how to run other AWS regions reliably as they have been built to avoid the extreme scale issues US-EAST-1 has. Ref: https://www.theregister.com/2024/04/10/aws_dave_brown_ec2_futures/ and
Depending on the situation yes, most often DNS is used for load balancing across multiple sites, though more commonly such load balancing is used for geo traffic distribution. Internal load balancing with DNS only is relatively rare, but some things do leverage it. Amazon has a history of layering DNS on top of DNS to mitigate(?) issues with them wanting to rotate IP addresses on some things (most commonly an issue with ELB(ironically? you can find old articles where customers got flooded with traffic for other customers due to DNS cache issues after AWS changed their ELB IPs), and I think RDS). Fortunately haven't had to deal with AWS myself in over a decade so stress levels are much less since I run my own stuff that is super stable.
PHB decided that machines left on overnight is bad.
Everyone switches machine off, goes home.
Gets to work. Powers up. Machine needs to download and install an update.
3,000 machines and nearly 24 hours later when no one has done a days work, it's ITs fault.
That's the funniest thing I've read in a while, superb!
You have to hand it to the ignorant PHBs who will gladly shout the techies down, implement some dictat or we get fired, we do as commanded and still get a bollocking 'cos PHB is desperately trying to keep the C-suite from chewing his arse to pieces over whatever his latest cock-up was!
> The day after patch Tuesday with everyone turning on their laptops, along with a Europe-wide VPN and a single corporate-wide web proxy is just as bad, believe me.
Try a 2 week Christmas shutdown and everyone turning back on the first working day in January and the AV systems deciding that it was too long for a delta and instead apply a full definitions update.
Burning WAN links and network guys screaming at us for using their precious bandwidth....
And yes we did put mitigations/throttling in place after the first time !
The beds actually do have the capability to work off a local control handset. But cloud is the cheapest and easiest way to enable voice control, by simply using Alexa or similar.
If you need one of these special beds then there's a chance that voice control will give you independence.
Some like to overlook that the bed doesn't actually depend on cloud in order to trigger the gullible. Maybe think it through first before filling another pair of Tena pants.
@Curmudgeon in Training
You didn't mention the subscription that goes with it...
What a world we live in when you have to pay a subscription fee to use your fucking bloodybollockybuggery bed working as intended.
https://help.eightsleep.com/en_us/can-i-cancel-autopilot-Bkbr7s9rn
> Autopilot, on a f*cking bed? What could go wrong
"Not tonight dear, I have a headache. Why don't you engage the autopilot."
The report explains what the race condition was, but not why the Enactor was running so slowly in the first place, which was technically the cause of the problem. I wonder whether they know - a challenge of work at this scale is that you can have problems that only happen with production workloads and it's hard to reproduce that and properly isolate the cause.
A cascade of FFS! causes:
1. The enactor should not have been running slow.
2. Nothing to monitor its status.
3. The various boxen had no idea what to do about it before jumping in feet-first, cuz nobody had thought through Murphy's Law.
4. Someone who knows what they are doing should have been retained with sweeteners, not driven out by bullying manglement.
I expect there are several more.
> And while the design may be good, ensuring implementation over years (and decades) of maintenance still achieves its robustness is even less easy.
Not to mention "minor" changes to the code which as pointed out elsewhere are impossible to test at scale----until now and the test failed.
Now just imagine some weird scenario that takes out ALL of AWS or Azure and how long it would take to recover.
Business Continuity Plans anyone ? I think the only one for the cloud is Pray
That is very, VERY rarely the case. Yes, I started in microprocessor validation, but there are still ways to slow down a box. Take such a box, put it on an isolated network, and hook it up to a large number of not-slow boxes & let them go to town.
I get that most people would not have been exposed to such solutions. The FAANG engineers absolutely. Wait. I guess those are agents now....
As soon as you have more than one or two operations happening at the same time, being (re)triggered (by external sources), or just operating at scale, then dealing with all of the permutations and combination is rather complex - to say the least. Even supposedly simple systems can be mind boggling sometimes.
At this point earlier articles about a brain drain in AWS make a lot of sense. Experienced developers/network admins would understand these complexities, and probably not roll out "quick changes" to anything without a thorough review (including in-house experience about parts of the system that are more prone to issues...etc). Juniors are more likely to feel the pressure to deliver (and kowtow to a PHB), and roll something out without understanding the implications. Worse, if someone was cutting and pasting simplistic code generated from AI...
I can just imagine the junior guy consulting with the AI frantically asking what do I do, what do I do? The AI helpfully responds with, "try turning the power off and on to reboot everything". I mean that is the uSoft script I think. And the AI scrapes the most probable answer!
@pfalcon
Came here to say that about asynchronous!
Every time you have two processes running with the same target, it is inevitable that they will at some stage process out of order; not just a chance, inevitable.
If you don't have a way to know that is happening, or a way to deal with what happens when it does, you shoulld stick to synchronous locks.
Certainly not for AWS. Remember, they started out selling excess capacity after the Christmas rush. Their resilience has mostly be reactive & after-the-fact. I learned SRE at Google. Resilience there was critical from the get-go, so I expect that it should be easier to achieve HA there, but I've not actually worked that side yet to have any idea what it's like for GCP.
Some folks have DNS outages, automation messing things up.
All critical DNS entries on my systems have always been manually managed, and the majority of them can go for many years without needing any changes. IPs are statically assigned across the board, and lifetimes of systems are measured in years again. Slower change = less times things can go wrong. There is some automation for folks who wish to create new systems that DNS entries can be created automatically (I don't leverage this myself but others use it), but there has never been anything that modifies existing DNS entries (worst case you get a duplicate DNS entry if DNS/IPAM doesn't agree when building a new system but that only impacts that one new system).
Some people strive to automate everything, with cloud the automation needs are quite a bit higher as there is far more complexity. I prefer simpler systems, generally have less automation, my mantra is more if your automation saves a bunch of time in the long run then that's fine could be good to do it. But if you spend a lot of time creating, maintaining and testing that automation that consumes as much time as doing it manually(or even close to as much time) then don't bother.
I feel like we have clashed a bit over this in the past. What I believe you are failed to note is the scale of AWS. My semi-educated guess is that they have between 10 million and 100 million servers. At that scale, "Doing it manually" really must be avoided like the plague, not just because of the cost involved, but because but because meat generally sucks at getting things perfect out of the gate. (Just check my posts for typos, for instance.)
It appears that you are running a tight ship, which is great. But it is a tiny, tiny ship compared to what AWS has.
Actually, if you have a problem on prem, in the eyes of the customers and, if it's big, the press, its YourCompany that's the problem.
If the problem is AWS, then it's AWS that's the problem.
That's a huge reputational difference.
And please don't tell me that sufficiently complex on-prem systems don't have similar length outages. We just don't discuss them here.
In $DAY_JOB I frequently run up against smart arses who say "Why do you do it like that? I do this, it works and is super easy."
What those idiots forget is they are looking after a couple of VMs/Servers/Switches/etc. Come back to me with your "ideas" once you've got experience of running thousands of any of them. (And the company and all its employees depend on them all working)
Most businesses _really_ don't need five 9s though. Four, maybe. Three, definitely.
If your business can't cope with one of your IT systems having an unexpected lie down for 6 minutes over the course of a year, you're either doing something incredibly niche and safety-critical, or you're over-reliant on that system.
> Most businesses _really_ don't need five 9s though. Four, maybe. Three, definitely.
They need it whilst production is running
The IT systems can have a lie down at other time, we just need a way of scheduling the failures
I agree that the bulk of businesses don't need five nines. Most that do will also be on-prem for their business-critical requirements (manufacturing facilities, hospitals). I also KNOW that emergency dispatch in many locations were using Hangouts when I was an SRE. Mental health hotlines will be very similar. I expect that stock borkers (intended) also really need 5+ nines.
Anyone doing retail sales REALLY doesn't want to be down for even fifteen minutes on Cyber Monday--you can make a case for 5 nines if you are in retail.
But yes, if your business can be down for a full day with no serious consequences, you only need three nines, and your cloud spend will be expensive. But less expensive than a dedicated sysadmin if you are a small business/startup. At my last company, with roughly 100 people, had an AWS + Heroku bill of $5000/mo when I started. We were a medical services middle-man, so we really did need better than four nines as well.
Thank goodness our proposed bright, shiny and new 'digital ID' system will be completely immune from any similar effect, eh? Thank the Lord that the system will NEVER go down and leave millions of citizens stranded without access to their money, ID, proof of driving licence and insurance, disappeared hospital appointments and probably no way to fuel their car as cash pumps will have been banned by The Ed Milli Band.
DNS is distributed - always was. BIND allows delegation to multiple autonomous redundant DNS servers which are distributed around the place so they won't all go at once. At one time (still?) the German registry insisted a domain's glue records had to be on separate subnets for redundancy. They thought it out. Only if you lost ALL the root servers couldn't you knock the system out, and there are loads - located on different continents. The load was balanced by virtue of multiple DNS servers dealing with a small number of requests, managed by the domain administrators. Any failure would be isolated to the domains in question. The Unix peeps at Berkeley knew what they were doing.
Then some idiot thought "Let's set up a single point of failure and get lots of people to outsource their DNS to it, and lets manage it using a complicated database system and stuff it in our cloud so we'll get lots of money. Never give a sucker an even break"
So do you blame Amazon, or the management fools that rushed to the "cloud" because they fell for the marketing hype?
It may be efficient to design the cloud like a pyramid on its head, a more confederated solution would be more robust.
Perhaps cloud giants like MSFT, GOOGL and AMZN should have a look at their cloud designs and divide their empire into independent regions. It is not good when a single bug or typo reverts time to the 1950's.
the "skill level" at Amazon is highly uneven. DynamoDB being one of the original services it used to be run by top-shelf talent. Similarly S3. But even 10+ years ago the brain drain was in full swing, it was reduced to monkeys from India (H1b natch) desperately trying to understand how the system worked and how to keep all the plates in the air and spinning. Said monkeys had never run a complex system, let alone seen one, and were entirely inadequate to the task. (This was when Jasse was head of AWS, several years before his promotion to CEO)
The S3 service for example relies on the notion of eventual consistency for customer data. So it has large leeway in how things are presented to the user. DynamoDB follows similar "eventual consistency" but on much shorter timescales. People FORGIT or IGNORE these realities of these 2 services and treat them as ATOMIC with immediate READ-AFTER-WRITE consistency. From there all kinds of hilarity ensues.
It took 3 days for the S3 ecosystem (back then a mere 300,000 nodes) to coalesce and assemble a 'system image' as distinct pods or spheres gradually merged and updated their maps of "who is my neighbor".DynamoDB works similarly. BUT their code-base maybe VERY outdated. It is common to find teams in AWS using a dozen different versions of the main tooling because they can't be arsed to stay current or even moderately so. So the race-condition between the 2 update/execute *may* have had at its root, divergent code-bases and thus behaviors.
IMO the constant fudging with DNS is an outgrowth of not understanding how the system was never intended to be contorted into how AWS uses it, and I suspect they could have used SLOW (relatively speaking) state-map updates to keep the map churn at a reasonable level. People who use AWS, and frankly internally too, don't seem to understand that INVALID DATA is FINE. Wait a bit and try again, or try the alternate answers. But no, they keep trying to achieve "perfection" in accuracy and you get these bolted on "improvements" which wreak havoc when some of the fundamental assumptions prove to be incorrect.
From the postmortem:
"The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing."
time-of-check, time-of-use bug.
You can fix this in one of two ways: Either check immediately before every change to make sure something hasn't updated it behind your back, or put a lock on it at the start so nothing can.
THE ACTIVE CONFIG COULD BE DELETED.
That should never happen.
The right way to do it would be to copy the active config, insert and delete on the copy, then once nobody is holding a lock on it you sanity check it. If there is nothing obvious (like it being completely empty!) you switch to it.
You also keep the previous config so you can quickly fail over if things crash.
...and they told everyone. I daresay a permanent fix will follow. But.....
We're being asked to buy stuff that's centered around an always on Internet connection that doesn't really need this, its just a gimmick to collect data for sale to information brokers. I find it profoundly annoying that, for example, my house thermostats have to converse through a a series of remote servers for me to be able to access them remotely (that is, from just across the ***!!** room). This is inherently bad, unstable, design -- you'd never design an industrial plant like this so why inflict it on people at home? (Greed.....of course.....)