How dare you bring so-called "facts".
Everything you know about last week's AWS outage is wrong
AWS put out a hefty analysis of its October 20 outage, and it's apparently written in a continuing stream of consciousness before the Red Bull wore off and the author passed out after 36 straight hours of writing. I'm serious here. It's to the point where if I included a paragraph half as long as some of these, El Reg's editor …
COMMENTS
-
-
Monday 27th October 2025 12:13 GMT Anonymous Coward
Re: Keeping it up is hard
> if companies were to decentralise, their downtimes would be greater.
Except, as mentioned in the article, part of the reason why Amazon has a better uptime record than their rivals is because they *are* more decentralized!
Decentralised systems are generally more resilient as a whole than centralised ones. Even if individual parts were to fail more often, those failures would be less impactful and the system itself endure.
-
Tuesday 28th October 2025 11:51 GMT Tim 11
Re: Keeping it up is hard
This exactly.
As someone who splits their time almost equally between AWS and Azure, It's impossible to overstate the difference in reliability between the two providers.
This also applies to accountability - it's difficult to beleive Microsoft would even be able to diagnose what went wrong, let alone whether they would be prepared to explain that to their customers. The attitude of "if anything goes wrong just try it again and it might work" applies throughout all MS software and services I've used.
-
-
Monday 27th October 2025 12:52 GMT Anonymous Coward
Once upon a time, a valid question was "how many servers can a sysadmin handle." Today the answer is either "all of them," or else you're doing it wrong.
That certainly seems to be the prevailing belief amongst management in my place of work, there appear to be fewer and fewer sysadmins. Also, fewer developers, QAs... still lots of managers though.
-
Monday 27th October 2025 13:03 GMT Jamie Jones
"They employ some of the best engineers on the planet to think about these problems at a scale that few of us can really contextualize"
Many of the comments I've seen have said that they USED to employ the best engineers. It would be good to know if brain drain is an issue or not.
We've all seen many companies start out well, employing brilliant engineers to design the infrastructure, then invariably later on, those in charge (especially after a company has changed hands) wonder why they are paying so much to people to run this faultless system, so many engineers are sacked, leading to a huge titsup when something inevitably goes wrong.
-
Monday 27th October 2025 13:20 GMT elsergiovolador
There is an order to this.
1. New manager comes in - identifies engineering where costs "savings" could be made (they just read The Register and watch cat gifs, no evidence of work being done!)
2. Presents compelling picture to the board, with all the right keywords to tickle the fun bones.
3. % of the gained value is of course the bonus for the said manager.
4. "Savings" are executed.
5. Manager pockets the bonus and leaves.
6. ???
-
Monday 27th October 2025 13:58 GMT Dan 55
I think this article better lays out the problem, including staff churn data.
-
Tuesday 28th October 2025 12:01 GMT Anonymous Coward
The best run systems aren't noticed
In my career, I came across numerous situations where staff were "released" because their departments/systems ran without problems - at least until they weren't there to keep them running without problems.
One example I was directly involved with (and may have written about previously here) was an offshore oil production platform where their oil export systems had never given any problems. The technician (well, two technicians on fortnight rotas) who maintained the system and kept calibrations up to date weren't stretched with work and often helped out on other instrumentation tasks on the platform. Manglement decided their work could easily be covered by other instrument techs, so their role was made redundant. Less than a year later I arrived for my annual audit of their export metering system and, without too much time passing, I found a metering error outwith the contract tolerance fo rthe pipeline they were using. I back-tracked their export to the last known good point and my report would then result in the invalid export to be deducted - several million USD off their company income. The error arose from a problem elsewhere on the platform, one that had taken priority for the team who had inherited the export metering - had the dedicated technician been on board, he wouldn't have been busy elsewhere and that several million USD would have remained in their credit. The role was quickly reinstated.
In my retirement, I often get involved with running even sound systems - any sound engineer will attest that their presence is only noticed when the sound isn't right. Do your job correctly, with everyone hearing as they expect, and you go unnoticed. Inadvertently press the wrong button on the mixing desk and (especially if it results in a brief moment of audio feedback squeal) and you become th edevil incarnate.
So yes, it's easy for new bean-counters to make savings by dispensing with people who have done their jobs well and, as a result of their professionalism, have gone almost unnoticed.
-
-
Monday 27th October 2025 21:14 GMT Bluck Mutter
change control
I consulted for decades into extremely large organizations (commercial and government) and the type of work I did (mission critical database migrations and failover/DR implementations) by definition required a downtime.
In almost every case, the organizations would blacklist months at a time where no changes could take place, no matter how trivial.
So a US online retailer might blacklst November to January to cover all the sales events over November/December and all the returns that happen in January.
A financial org might blacklist a month before and a month after their End of Year reporting cycle.
An org that made stuff would blacklist the period where their order processing was at it's highest.
An airline might have multiple shorter duration blackout periods per year, aligned with peak travel volumes.
So serious question. If these orgs have been stupid enough to move their mission critical apps to the cloud, how do they stop some lowly paid guy in Mumbai making some careless change during these originally blacklisted onperm periods?
They can't!!!!
Secondly, the issue now is the Cloud is too big to fail and despite the authors castigating us keyboard jockeys (noting we actually might know what we are talking about...for example I know how to design and implement failover and DR), the Cloud just can't keep growing expotentially unless regions are 100% autonomous with nothing shared with other regions.
Yet time and time again we see contagion in one region spread to others or we see some global resource that is shared cause multi-region issues when that global resource shouldnt be global.
Bluck
-
-
Tuesday 28th October 2025 11:58 GMT Cliffwilliams44
Re: "Its always DNS"
Only someone who does not understand how AWS works would make a statement like this!
AWS may fail over hardware behind the scene because of many factors you are not aware of, and the actual IP address of your load balancer may change, that's why you always use the assigned DNS address.
Your targets are references by service, e.g. Instances NOT IP addresses for the very same reason!
Speak not of what you do not know, because you do not know what you do not know.
-
Tuesday 28th October 2025 06:15 GMT sammystag
Strange title
"Everything you know about last week's AWS outage is wrong"
I was expecting to read some new revelation contradicting earlier explanations about what went wrong. However, the article lists one thing that would have been wrong if I thought it (that AI was the cause). It then confirms what I already thought from reading about it here, that it was DNS related.
"it's always DNS" is a half-step away from "this outage is caused by computers."
That's not my experience. I've dealt with with my fair share of outages over the years. One or two have probably been DNS related but the vast majority were not. Off the top of my head, database deadlocks, memory leaks, careless code causing full table scans or a lack of resilience to another service being down are all likely culprits.
-
Tuesday 28th October 2025 09:34 GMT Taliesinawen
A DominoesDB cascade of failures across AWS :o
The ship went down because it lacked a local caching DNS server. The AWS outage on October 20, 2025, all began with a single race condition in DynamoDB’s automated DNS management system—one mishap that toppled services across the cloud like, well, DominoesDB.
1. “Single points of failure can hide in the most sophisticated architectures. AWS operates with extensive redundancy — multiple availability zones, distributed systems, automated failover. Yet a latent race condition in a single critical system (DNS management) was able to trigger widespread failures. The sophistication of the architecture actually contributed to the complexity of the failure—the more interconnected and automated systems become, the more subtle race conditions and edge cases can emerge.”
-
Tuesday 28th October 2025 11:04 GMT glennsills@gmail.com
You are missing the point
It isn't a matter of making an application go "multi-cloud" - rather it is a matter of making lots of companies depend on lots of different clouds. A company would still be at risk, just as it is at risk with a multi-cloud application deployment, but the impact to the Internet writ large would be smaller.
-
Tuesday 28th October 2025 12:37 GMT An_Old_Dog
Big Red Switch Day
There are many mega-failure events which nobody tests for, due to the negative consequences of failure.
I once worked for a large research/teaching institution with many buildings on the main campus and 20~30 remote buildings scattered throughout the metro area. At that time, we had about 13,000 PCs and about 2,200 printers.
I never heard tell of us ever having had a Big Red Switch day, wherein all PCs and printers were powered off (not put into sleep mode) at COB, and the next morning, all researchers and support staff stood by their PCs and printers, and at 8:15 AM, everyone flipped on their respective Big Red Switches to see what would happen.
Would the electrical supply handle the surge current?
Would the network and DHCP servers handle the many near-simultaneous IP address requests?
Would the file servers handle the many near-simultaneous logins?
There, as in most other places, PC use grew like topsy, and everyone is just hoping/praying/ignoring about what might happen if an unlucky electrical supply event provides an unscheduled Big Red Switch test.
-
Friday 31st October 2025 03:09 GMT Not Yb
Wasn't there the opposite of this article only a few days ago?
"You can multi-cloud" to avoid this was pretty high in that article, and yet here you are, changing your mind again because AWS' press release looked good?
Hmmmm.... Did somebody at AWS call your editor or something?
Wink once if yes.
-
Friday 31st October 2025 14:34 GMT graemep
With a lot of dedicated work, you can add another cloud provider or AWS region or datacenter into the mix until finally, at tremendous effort and expense, you have added a second single point of failure
A second single point of failure? Contradiction in terms unless either one going down would bring the system down, which would be odd.
Multi could has proved valuable - for example the Aussie pension fund that found its data was saved because the regulator forced them to go multi-cloud.