I assume this means they've set the default TTL on DNS queries to 60 minutes. I've done that sort of thing (except down to 10 minutes) for scheduled changes to allow IP address changes to propagate faster when a change is made. 60 minutes is probably a reasonable compromise for unscheduled outages.
AWS builds a DNS backstop to allow changes when its notoriously flaky US East region wobbles
The cause of major internet outages is often the domain name system (DNS) and/or problems at Amazon Web Services’ US East region. The cloud giant has now made a change that will make its own role in such outages less painful. As explained in a Wednesday post, AWS customers told the cloud colossus “they need additional DNS …
COMMENTS
-
-
Thursday 27th November 2025 17:15 GMT theblackhand
It's not the DNS lifetime that caused the issue - it's the inability to create new DNS entries that stalls/stops provisioning resulting in a feedback loop that further overloads provisioning as failures generate even more provisioning requests.
I believe it means that if AWS experiences a major Issue with it's provisioning service, within 60 minutes (an SLA rather than how quickly they can potentially enable this feature), they will allow an alternative method of provisioning resources.
My guess is that AWS have preallocated a limited emergency range of DNS/IP mappings that can be safely allocated in an emergency and don't require DB access. That relieves pressure on the DynamoDB solution to allow key organisations to recover faster and probably allow AWS to recover faster as well.
-
-
Thursday 27th November 2025 07:57 GMT Anonymous Coward
At last!
There's been regular, albeit infrequent, global problems due to us-east-1 dependencies. At least if they are now admitting it openly it may mean a program to fix it and give regions true independence. If they do, I think it will provide advantage with critical app clients being more inclined to use AWS.
-
-
Thursday 27th November 2025 13:57 GMT Anonymous Coward
Re: At last!
Me thinks you are much too optimistic. IIRC, they wrote the entire thing themselves. This means that the more fundemental the problem, the more of their stack they're going have to redesign and then rewrite.
not sure how doing something in house can be seen as a negative in this respect.
my understanding was that the majority of their stuff was propriety in house & i'd have thought it weird and uncharacteristic for them to have got a 3rd party tool in to do this for them. A 3rd party they don't own or have design control over.
i don't see constantly outsourcing as a positive, more negative especially in crucial systems. i doubt a 3rd party could have written & implemented a fix in 6 weeks.
-
Thursday 27th November 2025 14:11 GMT Roland6
Re: At last!
>” my understanding was that the majority of their stuff was propriety in house & i'd have thought it weird and uncharacteristic for them to have got a 3rd party tool in to do this for them.”
My understanding is the tools are all based on Open Source with in-house/proprietary enhancements.
So in some respects their infrastructure is based on “ A 3rd party they don't own or have design control over.”
>” i doubt a 3rd party could have written & implemented a fix in 6 weeks.”
I doubt AWS have written a full solution, they have most probably modified the implementation to make better use of existing capabilities. Expect in the background a group is working on a more permanent solution. However, as with everything to do with cloud, don’t expect anything to be fed back to the original Open Source projects.
-
Friday 28th November 2025 07:51 GMT StewartWhite
Re: At last!
AWS haven't fixed the problem. They've just introduced a bodge that might make things somewhat better next time round although as with any bodge there's a reasonably high likelihood that it will make things worse when the next issue occurs.
Ultimately AWS can't be bothered to properly resolve the issue as they've accumulated too much technical debt to make anything other than a complete redesign work but that would cost too much time and money according to the beancounters. After all, $10+ billion in profits doesn't go very far you know.
-
-
-
-
Thursday 27th November 2025 16:02 GMT anothercynic
Re: At last!
US-East-1 is the default selection in AWS, so it's no wonder that it gets hammered more than anything else.
Maybe AWS should check the rough region of where the user is logging in from to suggest alternatives... It would also default European users to Ireland-1 or similar for Europeans. Nothing stops you from selecting another region, but the default should be a region closer to you than US-East-1.
Just my opinion...
-
Thursday 27th November 2025 17:08 GMT Excused Boots
Re: At last!
"There's been regular, albeit infrequent, global problems due to us-east-1 dependencies. At least if they are now admitting it openly it may mean a program to fix it and give regions true independence.”
This does sound like a step in the right direction. But, I do worry that as AWS (and Azure and Google cloud) become more and more complex, and the people who originally designed and built it, leave or are let go (because AI), they may not actually know what all the dependencies on us-east-1 are, until it fall over.
A little like everyone is fine until the eponymous maintainer in Nebraska decides that he or she can’t be bothered anymore!*
* No I’m not providing a reference, you all know what it means.
-
Friday 28th November 2025 12:21 GMT TeeCee
Re: At last!
...give regions true independence.
Surely that's the exact opposite of the correct approach? If this really were cloud computing then a major outage anywhere should have no impact on the users, as other locations just pick up the workloads and continue transparently.
We still seem to be a long way away from getting the originally promised benefits, of which availability and fault-tolerance due to having no single point of failure were front and centre waving the flags.
-
-
This post has been deleted by its author
-
Thursday 27th November 2025 18:28 GMT Claptrap314
El Reg should be more sceptical
Yes, us-east-1 is huge. But that's not why so much of the trouble is there. AWS runs all of its "global" systems out of us-east-1, so a failure in that region affects a lot more than necessary. In particular, since IAM is global, if us-east-1 has issues impacting it, you cannot spin up new servers anywhere. Us-east-1 is also where experiments happen. (Or, at least, is was--have they fixed that?)
Again, Amazon has never had the right people in the room to build a resilient system.
-
Friday 28th November 2025 13:24 GMT disk iops
damnable fancy-pants
The problem here is 2-fold, too damn big of a single-zone - "we" (internal staff) were harping about this back in 2012. That it would be the ruin of us.
2 - too damn cute by half.
The footprint of these services is massive, and in particular DNS is LOUSY at 'real time' data updates in particular because every client software is STUPID about caching answers and won't move off and try the next one. But they chose to abuse the shit out of DNS so they could steer traffic and "rapidly" enter and exist resources that were coming in or going out/dead. IT IS NOT A SIN IF THE ENDPOINT IS NOT THERE! Since we KNOW DNS clients are total shite, you're supposed to do all your *magic* where people (clients) can't see it. Furthermore they can and should have rolled out a custom resolver across the fleet that was smarter than the typical dumb-ass implementation and do things like using A record priority tags.
They tried to force DNS into "instant" updates and it bit them rightfully in the ASS. It obviously doesn't help when your coders are too friggin stupid to write software that is DEFENSIVE in what it accepts. Partially wrong (old) answers are preferable to no answer. The internal rate-limiting is also mostly a manual intervention. There is/was very little AGGRESSIVE rate-limit logic built in. It is all written with the assumption "it just works".
Akamai did this correctly - the IPs are fixed as far as the 'front door' is concerned. All the machinations and adding/dead discovery happens on the BACKSIDE with KNOWN FIXED last-ditch endpoints in their CDN pods. So yes, user traffic may overrun the capacity of these fixed points, but that's relatively minor compared to service outage. Furthermore, internal cross-talk should have utter priority if not its own sub-division of resources that are impervious to end-user load.