back to article Everything you know about last week's AWS outage is wrong

AWS put out a hefty analysis of its October 20 outage, and it's apparently written in a continuing stream of consciousness before the Red Bull wore off and the author passed out after 36 straight hours of writing. I'm serious here. It's to the point where if I included a paragraph half as long as some of these, El Reg's editor …

  1. A Non e-mouse Silver badge

    How dare you bring so-called "facts".

    1. Aladdin Sane Silver badge

      This is the interwebs, we'll have no facts and well constructed arguments here.

      1. STOP_FORTH Silver badge

        Facts are good.

        I think he's letting them off a bit easily, though.

        1. sbegrupt

          Re: Facts are good.

          I think this is mostly due to the incident being a single-region failure.

          1. Pascal Monett Silver badge
            Trollface

            Re: Facts are good.

            Exactly. Only a "small number" of customers were impacted, right ?

    2. Blackjack Silver badge

      These facts did not mention Amazon Brain Drain or toxic work culture so change a few words and the same article could be used if the one to cause the outage was Google.

  2. Charlie Clark Silver badge

    Almost a direct contradiction…

    to the previous article by the same guy.

    1. MyffyW Silver badge

      Re: Almost a direct contradiction…

      Indeed - has the real Corey been abducted by Amazon and replaced with a Synthetic Stepford Scribe?

      1. Anonymous Coward
        Anonymous Coward

        Re: Almost a direct contradiction…

        "Replaced with a Synthetic Stepford Scribe?"

        Just Alexa with a genAI upgrade ?

    2. steviesteveo

      Re: Almost a direct contradiction…

      I'm not saying Amazon have threatened his family but this is what it would look like

  3. Baird34

    Keeping it up is hard

    Credit where it's due, AWS perform outstaandingly well, if companies were to decentralise, their downtimes would be greater.

    I have a hard enough time keeping my home lab up.

    1. Anonymous Coward
      Anonymous Coward

      Re: Keeping it up is hard

      > if companies were to decentralise, their downtimes would be greater.

      Except, as mentioned in the article, part of the reason why Amazon has a better uptime record than their rivals is because they *are* more decentralized!

      Decentralised systems are generally more resilient as a whole than centralised ones. Even if individual parts were to fail more often, those failures would be less impactful and the system itself endure.

      1. cdegroot

        Re: Keeping it up is hard

        Only if the pieces are independent.

        Which they never are.

        Not even I'm AWS, where global systems like IAM invariable worsen the scope of a regional outage.

    2. Charlie Clark Silver badge

      Re: Keeping it up is hard

      There are systems out there with uptimes of decades. In the meantime the IT has reinvented the wheel of contained systems several times…

      1. Gene Cash Silver badge

        Re: Keeping it up is hard

        Sure... my garage door Raspberry Pi has uptimes of months because that project is done and dusted, and no longer changing. That's the only time you get big uptimes.

        1. Charlie Clark Silver badge

          Re: Keeping it up is hard

          I think NTT used to figure big in such lists.

        2. vekkq

          Re: Keeping it up is hard

          wait until the sd card corrupts.

    3. Tim 11

      Re: Keeping it up is hard

      This exactly.

      As someone who splits their time almost equally between AWS and Azure, It's impossible to overstate the difference in reliability between the two providers.

      This also applies to accountability - it's difficult to beleive Microsoft would even be able to diagnose what went wrong, let alone whether they would be prepared to explain that to their customers. The attitude of "if anything goes wrong just try it again and it might work" applies throughout all MS software and services I've used.

  4. vogon00

    'So yes, "it's always DNS" is a half-step away from "this outage is caused by computers." '

    DNS failure was a symptom, with a cascade of failures in the DNS-dependant upper layers as a result.

    The root cause (Of the original DNS fault) was human-induced somehow.

    1. MyffyW Silver badge

      The cascade of failures is the aspect that should concern AWS. That tech will fail is almost a given, but an estate that can't cope with erroneous behaviour in one part is essentially a house of cards.

      1. Cliffwilliams44 Silver badge

        "tech" did not fail, technician failed!

        1. Tim 11

          it was a software fault, but not in DNS itself; it was in the synchronization of multiple processes trying to update DNS

  5. elsergiovolador Silver badge

    Scapegoat

    It's always DNS

    Not people who wrote DNS, deployed it, maintained it. No. It's DNS.

    "A gun killed him, your honour, not me, I was only aiming and pulling the trigger, therefore I am innocent."

  6. Anonymous Coward
    Anonymous Coward

    Once upon a time, a valid question was "how many servers can a sysadmin handle." Today the answer is either "all of them," or else you're doing it wrong.

    That certainly seems to be the prevailing belief amongst management in my place of work, there appear to be fewer and fewer sysadmins. Also, fewer developers, QAs... still lots of managers though.

  7. Jamie Jones Silver badge

    "They employ some of the best engineers on the planet to think about these problems at a scale that few of us can really contextualize"

    Many of the comments I've seen have said that they USED to employ the best engineers. It would be good to know if brain drain is an issue or not.

    We've all seen many companies start out well, employing brilliant engineers to design the infrastructure, then invariably later on, those in charge (especially after a company has changed hands) wonder why they are paying so much to people to run this faultless system, so many engineers are sacked, leading to a huge titsup when something inevitably goes wrong.

    1. elsergiovolador Silver badge

      There is an order to this.

      1. New manager comes in - identifies engineering where costs "savings" could be made (they just read The Register and watch cat gifs, no evidence of work being done!)

      2. Presents compelling picture to the board, with all the right keywords to tickle the fun bones.

      3. % of the gained value is of course the bonus for the said manager.

      4. "Savings" are executed.

      5. Manager pockets the bonus and leaves.

      6. ???

      1. spuck

        You forgot:

        0. Board determines they want to cut staff, but are all too cowardly to look anyone in the eye and do it themselves, so they vote to bring in a 'hatchet man'.

        I can fill in this one too:

        6. Board pockets their own bonuses, congratulate each other on another great quarter.

        1. elsergiovolador Silver badge

          "Gentlemen... oh and you Sarah, apologies... we are in this together!"

    2. Dan 55 Silver badge

      I think this article better lays out the problem, including staff churn data.

    3. Anonymous Coward
      Anonymous Coward

      The best run systems aren't noticed

      In my career, I came across numerous situations where staff were "released" because their departments/systems ran without problems - at least until they weren't there to keep them running without problems.

      One example I was directly involved with (and may have written about previously here) was an offshore oil production platform where their oil export systems had never given any problems. The technician (well, two technicians on fortnight rotas) who maintained the system and kept calibrations up to date weren't stretched with work and often helped out on other instrumentation tasks on the platform. Manglement decided their work could easily be covered by other instrument techs, so their role was made redundant. Less than a year later I arrived for my annual audit of their export metering system and, without too much time passing, I found a metering error outwith the contract tolerance fo rthe pipeline they were using. I back-tracked their export to the last known good point and my report would then result in the invalid export to be deducted - several million USD off their company income. The error arose from a problem elsewhere on the platform, one that had taken priority for the team who had inherited the export metering - had the dedicated technician been on board, he wouldn't have been busy elsewhere and that several million USD would have remained in their credit. The role was quickly reinstated.

      In my retirement, I often get involved with running even sound systems - any sound engineer will attest that their presence is only noticed when the sound isn't right. Do your job correctly, with everyone hearing as they expect, and you go unnoticed. Inadvertently press the wrong button on the mixing desk and (especially if it results in a brief moment of audio feedback squeal) and you become th edevil incarnate.

      So yes, it's easy for new bean-counters to make savings by dispensing with people who have done their jobs well and, as a result of their professionalism, have gone almost unnoticed.

      1. goblinski Bronze badge

        Re: The best run systems aren't noticed

        So the sound engineer should have a periodical chime during non-important moments that would go "...This is what the sound sounds like without me..."

  8. Dr Paul Taylor

    Bad IT journalism

    try this one about the AWS outage:

    on the Guardian

    1. Dr Paul Taylor

      Maybe my comment was ambiguous. I trust El Reg for the facts about IT. The Guardian sometimes writes ok stuff about politics and other things.

      1. Anonymous Coward
        Anonymous Coward

        Oh, you mean the Guardian article is bad, not this one?

    2. Diogenes8080

      Re: Bad IT journalism

      Did they claim it was all caused by a DSN fault?

      A knowledge of right-pondian journalism is required to recognise that joke.

  9. Kevin McMurtrie Silver badge

    "they have never had a global outage"

    Some types of instances are effectively unavailable outside of us-east-1. It's a global outage unless you're wealthy enough to purchase dedicated hardware elsewhere.

  10. Andy Tunnah

    Finally, Journalistic Integrity

    > I too desire attention, but you've gotta earn it

    See this is why we come to el reg - the last bastion of honesty and integrity from the fourth estate.

    That and the self loathing.

  11. C R Mudgeon Silver badge
    Pint

    "'theory' being the name of my staging environment"

    That there? It deserves a pitcher, all on its own.

  12. Anonymous Coward
    Anonymous Coward

    "A single AWS region is a single point of failure."

    Didn't XKCD teach us that it was all coming from the same place anyway?

    1. EnviableOne

      Re: "A single AWS region is a single point of failure."

      one of my go-tos:

      https://xkcd.com/908/

  13. Bluck Mutter

    change control

    I consulted for decades into extremely large organizations (commercial and government) and the type of work I did (mission critical database migrations and failover/DR implementations) by definition required a downtime.

    In almost every case, the organizations would blacklist months at a time where no changes could take place, no matter how trivial.

    So a US online retailer might blacklst November to January to cover all the sales events over November/December and all the returns that happen in January.

    A financial org might blacklist a month before and a month after their End of Year reporting cycle.

    An org that made stuff would blacklist the period where their order processing was at it's highest.

    An airline might have multiple shorter duration blackout periods per year, aligned with peak travel volumes.

    So serious question. If these orgs have been stupid enough to move their mission critical apps to the cloud, how do they stop some lowly paid guy in Mumbai making some careless change during these originally blacklisted onperm periods?

    They can't!!!!

    Secondly, the issue now is the Cloud is too big to fail and despite the authors castigating us keyboard jockeys (noting we actually might know what we are talking about...for example I know how to design and implement failover and DR), the Cloud just can't keep growing expotentially unless regions are 100% autonomous with nothing shared with other regions.

    Yet time and time again we see contagion in one region spread to others or we see some global resource that is shared cause multi-region issues when that global resource shouldnt be global.

    Bluck

  14. DS999 Silver badge

    "Its always DNS"

    It was only DNS this time because Amazon does their internal load balancing using DNS. That made sense at one time, but I'm not sure it still does in 2025.

    1. seven of five Silver badge

      Re: "Its always DNS"

      Probably something someone kludged together in 2007 and which has since then been festering until, a decade later, no one dared to ever touch again.

    2. Cliffwilliams44 Silver badge

      Re: "Its always DNS"

      Only someone who does not understand how AWS works would make a statement like this!

      AWS may fail over hardware behind the scene because of many factors you are not aware of, and the actual IP address of your load balancer may change, that's why you always use the assigned DNS address.

      Your targets are references by service, e.g. Instances NOT IP addresses for the very same reason!

      Speak not of what you do not know, because you do not know what you do not know.

      1. Freddie

        Re: "Its always DNS"

        Then how does one know that they do not know it? Tell me of what I do not know! Or do you not know it?

  15. sammystag

    Strange title

    "Everything you know about last week's AWS outage is wrong"

    I was expecting to read some new revelation contradicting earlier explanations about what went wrong. However, the article lists one thing that would have been wrong if I thought it (that AI was the cause). It then confirms what I already thought from reading about it here, that it was DNS related.

    "it's always DNS" is a half-step away from "this outage is caused by computers."

    That's not my experience. I've dealt with with my fair share of outages over the years. One or two have probably been DNS related but the vast majority were not. Off the top of my head, database deadlocks, memory leaks, careless code causing full table scans or a lack of resilience to another service being down are all likely culprits.

    1. Anonymous Coward
      Anonymous Coward

      Re: Strange title

      ...surely you mean: Strange game, since as we all know: the only winning move is not to play.

      Come to think of it, that is also true of hosting critical services exclusively on someone else's computer.

  16. Pascal Monett Silver badge
    Thumb Up

    Love it

    Microsoft is convinced that "uptime" should be two words, probably in someone else's account.

    That is satire at its best.

  17. Taliesinawen

    A DominoesDB cascade of failures across AWS :o

    The ship went down because it lacked a local caching DNS server. The AWS outage on October 20, 2025, all began with a single race condition in DynamoDB’s automated DNS management system—one mishap that toppled services across the cloud like, well, DominoesDB.

    1. “Single points of failure can hide in the most sophisticated architectures. AWS operates with extensive redundancy — multiple availability zones, distributed systems, automated failover. Yet a latent race condition in a single critical system (DNS management) was able to trigger widespread failures. The sophistication of the architecture actually contributed to the complexity of the failure—the more interconnected and automated systems become, the more subtle race conditions and edge cases can emerge.”

  18. glennsills@gmail.com

    You are missing the point

    It isn't a matter of making an application go "multi-cloud" - rather it is a matter of making lots of companies depend on lots of different clouds. A company would still be at risk, just as it is at risk with a multi-cloud application deployment, but the impact to the Internet writ large would be smaller.

  19. Apocalypso - a cheery end to the world Bronze badge
    Joke

    Breakfast

    > "When thing X happens, do thing Y" is how computers basically work, and it's about as much "AI" as it is a breakfast cereal.

    I eat grains for breakfast. Often resulting in movements. A bit like real AI.

  20. An_Old_Dog Silver badge

    Big Red Switch Day

    There are many mega-failure events which nobody tests for, due to the negative consequences of failure.

    I once worked for a large research/teaching institution with many buildings on the main campus and 20~30 remote buildings scattered throughout the metro area. At that time, we had about 13,000 PCs and about 2,200 printers.

    I never heard tell of us ever having had a Big Red Switch day, wherein all PCs and printers were powered off (not put into sleep mode) at COB, and the next morning, all researchers and support staff stood by their PCs and printers, and at 8:15 AM, everyone flipped on their respective Big Red Switches to see what would happen.

    Would the electrical supply handle the surge current?

    Would the network and DHCP servers handle the many near-simultaneous IP address requests?

    Would the file servers handle the many near-simultaneous logins?

    There, as in most other places, PC use grew like topsy, and everyone is just hoping/praying/ignoring about what might happen if an unlucky electrical supply event provides an unscheduled Big Red Switch test.

  21. hx

    Meh, I'm better at uptime at a much lower price point, but I'm not in the business of selling solutions to problems created by our other solutions. Probably why I haven't retired from the main job to focus on my passion of building rockets or smart toilets.

  22. Not Yb Silver badge

    Wasn't there the opposite of this article only a few days ago?

    "You can multi-cloud" to avoid this was pretty high in that article, and yet here you are, changing your mind again because AWS' press release looked good?

    Hmmmm.... Did somebody at AWS call your editor or something?

    Wink once if yes.

  23. graemep Bronze badge
    WTF?

    With a lot of dedicated work, you can add another cloud provider or AWS region or datacenter into the mix until finally, at tremendous effort and expense, you have added a second single point of failure

    A second single point of failure? Contradiction in terms unless either one going down would bring the system down, which would be odd.

    Multi could has proved valuable - for example the Aussie pension fund that found its data was saved because the regulator forced them to go multi-cloud.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon