back to article Amazon brain drain finally sent AWS down the spout

"It's always DNS" is a long-standing sysadmin saw, and with good reason: a disproportionate number of outages are at their heart DNS issues. And so today, as AWS is still repairing its downed cloud as this article goes to press, it becomes clear that the culprit is once again DNS. But if you or I know this, AWS certainly does …

  1. Anonymous Coward
    Anonymous Coward

    Jesus, this makes very stressful reading for senior AWS executives. So stressful in fact, that a significant remuneration bump is needed to keep them onboard. And a performance bonus if it doesn't happen again in the next 3 months

    1. Anonymous Coward
      Anonymous Coward

      Speaking of Executives

      We had a situation where a half-dozen highly experienced grey-beards (40+ average age and very good engineers) were commanded by a newly appointed CIO (who was so young he couldn't grow a beard, but had a ornately framed MIS / CIS Degree from an ivy league school) to activate a project. They disagreed strongly writing comprehensive emails detailing numerous failings, he over-rode them, it went live and totally pear shaped in hours.

      The CIO then blamed his people for reckless behaviour and half were fired. The others left of their own volition within three months. The CIO got his just desserts. He became CEO of another company that hit bankruptcy within two years. They were bought out and he walked away with 15 million in his back pocket. So he learned his lesson.

      When people talk of AI taking jobs, I keep thinking replacing senior management with a LLM would make better sense.

      1. O'Reg Inalsin Silver badge

        The more things change .... (Gilbert and Sullivan's 1879 comic opera The Pirates of Penzance)

        I am the very model of a modern Major-Gineral

        I've information vegetable, animal, and mineral,

        I know the kings of England, and I quote the fights historical

        From Marathon to Waterloo, in order categorical;

        I'm very well acquainted, too, with matters mathematical,

        I understand equations, both the simple and quadratical,

        About binomial theorem I'm teeming with a lot o' news,

        With many cheerful facts about the square of the hypotenuse.

        ...

        For my military knowledge, though I'm plucky and adventury,

        Has only been brought down to the beginning of the century;

        But still, in matters vegetable, animal, and mineral,

        I am the very model of a modern Major-Gineral.

        1. agurney

          Re: The more things change .... (Gilbert and Sullivan's 1879 comic opera The Pirates of Penzance)

          sorry, wrong opus, Pinafore is more apposite:

          .. Of legal knowledge I acquired such a grip

          That they took me into the partnership.

          And that junior partnership, I ween,

          Was the only ship that I ever had seen.

          Chorus.

          Was the only ship that he ever had seen.

          But that kind of ship so suited me,

          That now I am the ruler of the Queen's Navee!

        2. Anonymous Coward
          Anonymous Coward

          Re: The more things change .... (Gilbert and Sullivan's 1879 comic opera The Pirates of Penzance)

          Ross Scott's version (https://youtu.be/2HdlRqvIyFs?t=113) contains amusing additional verses (and video game gunfire/explosions)...

      2. Anonymous Coward
        Anonymous Coward

        Re: Speaking of Executives

        That story makes me so sad. I know so many people like that... they have nothing but a sense of entitlement, and somehow do well.

      3. Anonymous Coward
        Anonymous Coward

        Re: Speaking of Executives

        One of our twit senior execs had a PhD in "leadership" lol. Presumably there was coverage of how much to tip the chauffer on Christmas.

    2. andrewj

      Can't they just vibe a new cloud into existence? I thought everything was easy now.

      1. John Brown (no body) Silver badge
        Coat

        You'd think so. The big "selling point" of cloud is availability. Most don't actually get that, because it's an expensive extra, but you'd think AWS themselves would have proper fail-ober for everything. :-)

        1. disk iops

          1 NIC. Port

          Every ec2 instance has 1 nic port. ONE. This includes all instances used internally for backplane. There is also only 1 top of rack switch. Every s3 jbod (45+ drives) has ONE power supply. Even better they had to replace 75000 power supplies because they shaved the wattage so close all it took was 2 drives to go into click of death mode to blow the fuse and wipe out the entire node.

          1. Claptrap314 Silver badge

            Re: 1 NIC. Port

            This is mostly actually more than just defensible. It is VERY clearly the right thing, with one, simple, caveat: the customers have to understand what this is.

            You don't build resilience by using perfect parts. The stack is ENTIRELY too high for that game to ever work. You build resilience by putting enough things in parallel that even when outages do a "Yo Daug", you are in the clear.

            Individual EC2 instances have NEVER been sold as resilient. If you desire resilience, you have to create applications which aren't bothered when EC2 instances go away. You also need an ops team that knows how to handle the necessary traffic routing & alarms.

            1. that one in the corner Silver badge

              Re: 1 NIC. Port

              > . If you desire resilience, you have to create applications which aren't bothered when EC2 instances go away. You also need an ops team that knows how to handle the necessary traffic routing & alarms.

              So you have to spend inordinate amounts on devs to do the extra clever coding and on a big and fast responding ops team to keep up with all the routine hardware failures? Certainly hope you're saving enough in the low, low cost of hiring the EC2 instances to make up for that.

              And all of that required and repeated in every customer.

              Wasn't one of the advantages of modern computing and communications supposed to be that we can build resilience through redundancy right into every layer of the stack, from paired PSUs and RAID through to hypervisors that automatically restart your processes on backup CPUs? Then use massive scaling of identical such units, and timesharing to ensure maximum utilisation, to bring down costs? With the multiple identical hardware boxes physically concentrated, allowing support and maintenance staff to be superbly expert in that specific kit rather than merely ok in handling a random assortment of parts? Thereby allowing all of the differences in the workloads to be concentrated into just the topmost layer, the customer applications, so that each can concentrate on what is unique and revenue generating, relying upon a robust infrastructure to supply the common parts?

              Is everyone absolutely certain that the current arrangement (which is presumably this "Cloud" we keep hearing about) actually sensible?

              1. werdsmith Silver badge

                Re: 1 NIC. Port

                One NIC per EC2 is actually more than I thought there would be. I pictured a host with loads of EC2 VMs sharing a couple of NICs.

                1. Claptrap314 Silver badge

                  Re: 1 NIC. Port

                  The hypervisor is slicing that NIC like everything else on the system. Yeah, 1 does sound a bit thin, but who knows. They're doing the math, I know that.

              2. Williaco

                Re: 1 NIC. Port

                "Is everyone absolutely certain that the current arrangement (which is presumably this "Cloud" we keep hearing about) actually sensible?"

                Is definitely is, according to Orson Krennic.

              3. Anonymous Coward
                Anonymous Coward

                Re: 1 NIC. Port

                My background is blue chip hardware vendor, now at hyperscaler so I have experience of both schools of thought.

                For me what you are describing comes down to, who's job is it to build resilience? There are also completely different challenges of running at different scales, which means the traditional on premises mindset just doesn't work in hyperscale, and vica versa..

                What you are describing as a stack sounds wonderful, but candidly is a result of infrastructure having to overcome the challenges caused by lazy developers not actually worrying about keeping their platform up. Its a vicious circle whereby infrastructure teams have to build a solution to an application resilience problem, so the application team don't think they need to build resilience, which means the infrastructure team need to build resilience .......... and round the circle we go.

                Whether you call it cloud native or not, hyperscalers are very clear that resilience is the job of the application developer, infrastructure is a commodity and you need to stop thinking in single units and starting thinking of scale ..............

                Simplistically when you only have 100 physical servers in a single physical location, you have ALOT of potential issues that will severely impact the total estate and so are more focussed on eliminating SPOF and protecting yourself. Once you get up to 100,000+ physical servers and multiple geographical locations (which is basically a small AZ for AWS and Azure, you worry less about loosing a server or even rack and your resilience issue becomes more focussed on how do you manage the fleet at scale and stop SW issues, NW etc.

                1. that one in the corner Silver badge

                  Re: 1 NIC. Port

                  If you take my comment and simply remove the redundant PSUs then the description fits, no matter the scale: you have other storage & CPUs that processes can be run on and the job is to keep enough of the right ones of those alive to get the job done. Other parts of your description are impressive in human terms but can be compared to "how it used to be": Big example, moving your processes from one geographical region to another: in human terms, this is awfully impressive, but if you compare it to shuffling from one cabinet to another that is "only" two feet away in the days of wet string instead of multi-gigabit glass then the modern hardware will complete the data transfer faster and more reliably. Similar statements can be made about every part of the system: the "only" difference is that we all can see the massive, massive size of the hyperscaling because us humans have stayed the same size and we'll get puffed out walking from one end of the cable to the next.

                  > infrastructure is a commodity ... think at scale

                  Yes. Again, as computers have grown up, each part has gone from being a carefully curated single instance to a commodity: no longer would we consider saying that the memory bank over there ("the one we've labelled 'Nellie'") is something that needs care and attention with a hot soldering iron and can of contact cleaner - instead we rip out the DDR with a few milion Nellies on it and pop in a new one.

                  Everything in computing has followed the same scaling.

                  And each time scaling occured, we started by making it the responsibility of the application code to take advantage of it. Again, redundant PSUs are pretty much the only place where it was not dropped onto the coder: every other use of system resources was down to the application. Mirroring data over multiple platters? Those rich enough to own three platters had that right there in the application code. Making use of two or more execution units? Started in the application. And you can see all of this still happening if you look at current day tiny systems: microcontrollers still get treated that way (and very sensible that is, too).

                  So what happened as these scaled-up systems became commodities? Do we still expect the application writers to handle all those aspects?

                  Of course not.

                  What is handling them?

                  Some on specialised hardware (RAID controllers, NICs, DDR busses, SSD internals- although plenty of this 'hardware' is still software running on embedded CPUs) and all the rest is handled by - the Operating System!

                  To cut this short, as I have to go out, the application team should NOT be the ones coping with EC2 units dying and shifting workloads. THAT should be being provided as a commodity, on top of the pile if cabinets in the data centre. There should be an Operating System taking all the commonality away from the application level.

                  There ought to be no reason to discuss ins and outs of keeping EC2 instances working than there is worrying about how RAM controllers manage to use 8 channels of DDR slots.

                  1. Claptrap314 Silver badge

                    Re: 1 NIC. Port

                    First, I would argue that you've lost track of what you are talking about in that last sentence. NO ONE is talking about the "ins and outs of keeping EC2 instances working" in this context. We're talking about achieving resilience, and, by specification, EC2 instances are not individually resilient.

                    If you don't want to use EC2 instances for your application, no one is requiring that you do.

                    AWS DOES sell higher-level systems where they are responsible for resilience. These services might well run you 100x what the low-level EC2, ECS, networking & monitoring services run. You do you.

  2. dsch

    Brain drain in this case is the mirror image of enshittification, applied on the inside of the service (employees) rather than the outside (users).

    1. elsergiovolador Silver badge

      Peasants should be grateful they have a job.

      1. kmorwath

        I wonder why AI didn't pre-emptively detected the fault and fixed it agentically....

        1. TRT Silver badge

          It tried, but it couldn't get access to the database of service connectors and dependencies.

        2. Matt Collins

          But what if it was Agentic AI having a jolly good go at something in the first place? It's only a matter of time...

          1. Elongated Muskrat Silver badge

            The "AI" suggests the IP address in this setting should be 127.0.0.1...

            1. JWLong Silver badge

              But, only if....................

              Your running Win11 with the latest update that seems to have killed the loopback!

        3. ABugNamedJune

          It was too busy preemptively buying plane tickets for Andy Jassy.

    2. Pascal Monett Silver badge

      Well it got applied on the outside in the end.

  3. mostly average
    Mushroom

    Fret not, shareholders

    AI will save us!

    1. Goodwin Sands

      Re: Fret not, shareholders

      >AI will save us!

      Funny you should say that. Here's Musk, this morning, not long after things started falling over.

      https://twitter.com/elonmusk/status/1980221072512635117

      1. DS999 Silver badge

        Re: Fret not, shareholders

        Wait I thought Musk is trying to sell everyone on AI and especially his pet AI company, not admit the reality that we're in a massive bubble that's about to pop.

        1. elsergiovolador Silver badge

          Re: Fret not, shareholders

          It is a massive source of intelligence for his Russian handlers. So cost is actually not important here.

        2. youwish

          Re: Fret not, shareholders

          yea, just HIS!

      2. Bryan W

        Re: Fret not, shareholders

        And then the article image is flagged as AI slop. It just gets worse the longer you look at it...

        1. W.S.Gosset

          Re: Fret not, shareholders

          Photoshop is now AI?

          1. Elongated Muskrat Silver badge

            Re: Fret not, shareholders

            Pretty much, yes. Since the point where they made it able to "just add a tiger here" and added $$$ to the license cost because that feature is very "useful".

      3. that one in the corner Silver badge
        1. Goodwin Sands

          Re: Fret not, shareholders

          Yes that's much better! I've been running my own nitter instance since nitter.net shutdown last year so hadn't realised it was back online again.

        2. Jamie Jones Silver badge

          Re: Fret not, shareholders

          Note that nitter doesn't display the community notes that show that what Musk posted is a load of photoshopped bollocks

      4. Anonymous Coward
        Anonymous Coward

        Re: Fret not, shareholders

        Probably a lower %tge than bots pushing X posts and supportive it.

      5. John Brown (no body) Silver badge

        Re: Fret not, shareholders

        JavaScript is not available.

        We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using x.com. You can see a list of supported browsers in our Help Center.

        Oh dear, never mind, I'm almost certainly not missing anything important :-)

    2. Dostoevsky Bronze badge

      Re: Fret not, shareholders

      "Can't have outages if you're never up in the first place!"

      [big_brain_meme.jpg]

  4. elDog Silver badge

    Can't believe I'm the first to suggest that AWS should run its systems with bots and AI.

    Just let all of their hosted AI systems look at Amazon's web services - performance, network diagrams, outages, etc. and I'm sure they'll come up with some fixes. If all the circuits are software patchable then no need for techies plugging/unplugging.

    Then throw in the Amazon financials and personnel organization plans and see if Bezos and every other human can't be sent packing.

    1. Someone Else Silver badge

      Re: Can't believe I'm the first to suggest that AWS should run its systems with bots and AI.

      Can't tell if I'm missing a <sarcasm> tag or not...

      1. Anonymous Coward
        Anonymous Coward

        Re: Can't believe I'm the first to suggest that AWS should run its systems with bots and AI.

        If you are Amazon senior management, then no, <sarcasm> is not missing. At least, not about the patching part. Bezos et al are critical infrastructure, of course.

        Otherwise ...

    2. W.S.Gosset

      Re: Can't believe I'm the first to suggest that AWS should run its systems with bots and AI.

      >and see if Bezos and every other human can't be sent packing

      Well... it's Amazon, so shouldn't that be,

      "and see if Bezos and every other human can't be sent packaging"

      1. Anonymous Coward
        Anonymous Coward

        Re: Can't believe I'm the first to suggest that AWS should run its systems with bots and AI.

        "Bezos .... be sent packaging"

        In 300mm cubes would favourite.

      2. Filippo Silver badge

        Re: Can't believe I'm the first to suggest that AWS should run its systems with bots and AI.

        My version would be "and see if Bezos and also every human..."

    3. Jedit Silver badge
      Boffin

      Re: Can't believe I'm the first to suggest that AWS should run its systems with bots and AI.

      Haven't Amazon just been bragging that something like 60% of their code is already AI-generated?

      The article says it's not about the technology being old, rather that the support staff are new. But I think we should consider the strong likelihood that the outage happened because the technology is new and the support staff just aren't there.

    4. stiine Silver badge
      Facepalm

      Re: Can't believe I'm the first to suggest that AWS should run its systems with bots and AI.

      Didn't Bezos retire? In what decade are you living?

  5. Anonymous Coward
    Anonymous Coward

    I currently work with AWS on a daily basis (and for the most part it's been an infinitely better experience than working with Azure).

    I've encountered several people who have worked for AWS - and most of them hated the culture and pressure. So much so that even though Amazon have tapped me up to go and work for them on what would probably be close to double my salary, I stay the hell away - I don't want to work in that kind of environment.

    Today doesn't surprise me - it's been coming. They've known about the us-east-1 SPOFs for years and clearly haven't been bothered to fix them. The quality of technical support has been getting worse and worse.

    Hopefully today will serve as a massive wake-up call for AWS. A lot of what they provide is great, and there are clearly still very good teams they work there. But they need to refocus on quality, and they can only do that with the best people. And in order to attract and retain the best, salary alone isn't enough any more for most of us. Daft thing is, if AWS sorted out their culture and working environment, they wouldn't necessarily even need to pay the top salaries, because the opportunity to work on interesting tech that underpins so much of the world's major companies would be a draw in of itself.

    1. Decay

      "Hopefully today will serve as a massive wake-up call for AWS"

      I wouldn't hold your breath. There will be incident reviews, meetings, assessments, analysis etc. but basically boil down to what can we do to stop this from happening again without actually spending any more money. So no, not hiring fresh talent or retaining that talent already in play, no to radical overhaul of process and knowledge. No to remediation of known issues if it involves expenditure. Instead it will be do more with less. Beat the employees harder, enforce more and more diligence and output from less and less people for the same or less money. Spin it like mad with catchy titles like knowledge sharing, centers of excellence, efficiency improvement initiatives, agile resilience, and continuous operational excellence.

      There’ll be shiny PowerPoint decks about empowering ownership and shifting left, while the remaining engineers are shifting caffeine straight into their bloodstream at 3 a.m.

      Next quarter, they’ll unveil a bold new policy called Focus Fridays which will be promptly filled with mandatory incident retrospectives. Someone will suggest replacing ancient tooling, only to be told, “We’ll revisit that next fiscal year,” which is code for never.

      Then come the internal awards: “Unsung Hero of the Outage” goes to the one poor sod who rebooted the wrong thing but accidentally fixed it.

      HR will roll out a “Resilience Recognition” badge on the intranet. This will be marketed with great fanfare and excitement, showcasing how the company truly values it's employees and recognized their contribution because badges are cheap. Leadership will congratulate themselves for “learning from adversity,” and by the time the next blackout happens, they’ll have a snazzy new dashboard to watch it fail in real time along side their investment portfolio dashboard that takes up a greater fraction of their attention.

      But don’t worry!!!! There’ll be a T-shirt. “I survived the 2025 AWS outage.” Comes in gray. Just like morale. If it wasn't for the negative impacts on the employees and customers the word Schadenfreude would be very applicable.

      And it's a sad indictment on current management practices and in particular the MBA brigade* that this is all by design, acceptable losses on the alter of profit, albeit short-term profit. Efficiency theatre as far as the eye can see.

      *Yes, the same people who think Jack Welch was a misunderstood visionary rather than the spiritual father of mass layoffs, short-termism, and shareholder-value human sacrifices. The kind who see burnout as a KPI and chaos as a “scaling opportunity.”

      Next they’ll launch a “Transformation Task Force” whose primary transformation will be renaming the same broken process from post-mortem to value realization review. A new acronym, a new logo, and boom, problem solved at a low low cost, honest, the consultants said so. Until the next outage, at which point someone will quote Sun Tzu in Slack.

      1. Michael Hoffmann Silver badge
        Unhappy

        Good lord, are you a former Amazonian, too? Because that sounds like it really was written by an insider. That's exactly how it works.

        I lasted 18 months, then it was out, or therapy. Buddy of mine thought he could last for his stock to vest. And had to quit and needed months of therapy. He's no snowflake. If you knew him and saw this guy just about break down in tears, you'd thought he had survived North Korea or something.

        1. ChoHag Silver badge

          This was all familiar to me and I've never worked in Amazon. Corporations are all like this. At $ork we're currently at the reimplementing the snazzy new dashboard stage. This one's got AI in it!

          1. Dagg

            Corporations are all like this

            Yep, So IBM!! know the cost of everything and the value of nothing. I've been gone from them over 10 years and still suffer the PTSD. Bastards.

            1. rcw88

              Price of everything and value of nothing, sounds like every company where HR and arseholes with MBA's (try to) run the business.

              Not the actual line management, you know, the ones that know their people, nurture them, develop them, train them, cover their arses when shit happens. Cos in any technology company shit WILL happen,

              its the first law of enshittification, get rid of the expensive grey hairs, the ones who will work night and day to keep the lights on, then wonder why your cheap gen Z'r who WFH and knows no-one useful cannot fix anything.

              in the race to the bottom everyone loses.

        2. Anonymous Coward
          Anonymous Coward

          We knew east-1 was teetering back in 2013 and that continuing the build out was foolhardy. But no we can't have east-3 and east-4 be also located in Chantilly or Herndon!

          I lasted 6 months as one of only twelve responsible for all world wide ops support for a certain other storage service. Granted I might have lasted a little longer had Snowden fallout not caused a big RIF. But it was the most stressful job I've ever held.

        3. Jason Hindle Silver badge
          Pint

          I did recently see what seemed to a reasonably well researched vid on YouTube about Amazon's working culture. It essentially boiled down to not encouraging employees to work for them long enough for the stock options to mature. It's on a pretty long list of employers I would avoid (in no small part thanks to El Reg's journalism over the years).

          In the US, they're the FAANG of last resort for engineering talent (and that's saying something, bearing in mind Meta is on the list).

          Icon? Let's just say the water don't taste like what it ought-a here. Cheers.

      2. Anonymous Coward
        Anonymous Coward

        How long were you in for?

        1. ladyonafarm

          Time in

          So true....I refer to my time as tours of duty

      3. Anonymous Coward
        Anonymous Coward

        The disrupter's become the corporate establishment the replaced. They cycle turns again.

      4. Charlie Clark Silver badge

        And if customers don't move then they'll be proven right.

      5. steelpillow Silver badge

        > they need to refocus on quality

        As in re F.O.C.U.S.

      6. Tim13

        All very familiar working for a mega-corp myself

        The new, leaner, presumably less expensive teams lack the institutional knowledge needed to, if not prevent these outages in the first place, significantly reduce the time to detection and recovery.

        HOW COME WE NEVER MET at the IT-dept?

        And you absolutely nail the mumbo-jumbo consultant adviser lingo which manager love so much (think outside the box: but say nothing about how I micro-manage every service aspect).

      7. Antipode77

        And it is all due to the misguided belief, that the ONLY purpose of a Company is making maximum profits for the directors and the Shareholders.

        https://en.wikipedia.org/wiki/Stakeholder_theory

        "The stakeholder theory is a theory of organizational management and business ethics that accounts for multiple constituencies impacted by business entities like employees, suppliers, local communities, creditors, and others. It addresses morals and values in managing an organization, such as those related to corporate social responsibility, market economy, and social contract theory."

    2. Anonymous Coward
      Anonymous Coward

      > Daft thing is, if AWS sorted out their culture and working environment,

      Unfortunately, big dumb corporations don't do that.

      Maybe in the Before Times, some companies might have done. Some.

      Nowdays craporate culture dictates that every last cent must be wrung out of people, until they're fully depreciated and written off. Someday they won't be hired in the first place; the big wheels will merely stand up another "AI" instance, savings will be banked, success will be declared, executive bonuses handed out, and what few employees are left will die inside a little more every day.

      It'll be a nightmare, of course -- "AI" is a bad con game on a good day, so the products and services will be fetid garbage, support will either be a twisted game of manipulative customer abuse or simply non-existent, prices will be an escalating swindle. But since every market will be dominated by monopolies, it's either put up with it all or go without.

    3. An_Old_Dog Silver badge

      Work Culture & Corporate Values

      ... originate and the top and roll downhill.

      1. BebopWeBop

        Re: Work Culture & Corporate Values

        I think it generally leaks, a brown sticky, smelly trickle.

      2. Like a badger Silver badge

        Re: Work Culture & Corporate Values

        Always been the case in any organisation, "shadow of the leader" is real.

        And that's why AWS won't change its nasty culture, because it starts at the very top with a former hedge fund manager, it percolates through AWS, and infests the tat-bazaar side of things which enjoys a similarly poor reputation as an employer. Unless the proles are working hard, nose to the grindstone to the point of burn out then Jeff thinks they're parasites. Also, like most imperial bosses he's recruited people who sound like him, think like him, and rarely challenge him, a culture evolves of fearing to break bad news and rejection of new ideas, and suddenly management is an echo chamber of preserving the status quo except when the leader lets rip with a brain fart, in which case it's the finest, most whizzy idea since man worked out how to make fire (eg sending a bunch of women including the head honcho's wife to the top of the atmosphere in a very phallic rocket).

        1. Antipode77

          Re: Work Culture & Corporate Values

          Seems corporations need a lot more Democracy.

          Can't have Emperor Bezos, Zuckerberg, etcetera.

    4. Anonymous Coward
      Anonymous Coward

      AWS are literally shitting money as the revenue driver at Amazon … so it’s not like the choice to fund it is hard.

      AWS is heading down the Azure/M365 path of Enshittification and AI.

    5. John Brown (no body) Silver badge

      "Daft thing is, if AWS sorted out their culture and working environment, they wouldn't necessarily even need to pay the top salaries, because the opportunity to work on interesting tech that underpins so much of the world's major companies would be a draw in of itself."

      Yeahbut, someone has to fund Bezos massive penis rocket.

    6. Anonymous Coward
      Anonymous Coward

      sure Jan.

  6. Dan 55 Silver badge

    As well as brain drain at Amazon, there's brain rot (AI)...

    Universities Are Part of the Cursor Resistance

    So these students were in for some culture shock when they spent the past summer interning at Amazon, where their managers strongly encouraged them to use AI coding tools. When they used Cline, the coding agent of choice for their teams, their managers told them to keep up the good work. When they didn’t use Cline, their managers asked why not.

    One intern recounted bringing errors to his manager to help solve. The manager would copy and paste the code into Cline and instruct the AI to fix the error, instead of fixing the bug manually.

    As a result, the intern said he wrote fewer than 100 lines of code himself over the summer, while Cline wrote thousands. A spokesperson for Amazon said employees are encouraged but not required to use AI tools.

    If they're just copying and pasting and doing what the coding assistant says, eventually they're going to screw up.

    1. disgruntled yank

      Re: As well as brain drain at Amazon, there's brain rot (AI)...

      Cline, as in Patsy? Because I'm crazy for crying/And crazy for trying/And crazy for trusting you? Or because It Falls to Pieces?

    2. Jason Hindle Silver badge

      Re: As well as brain drain at Amazon, there's brain rot (AI)...

      That gets an up-vote from me, and I'm probably one of the bigger users of AI on the team I work on (mainly as an alternative to Google search). Asking students to get AI to write the code for them shows a frightening lack of self awareness on the part of management*.

      * But then again, we are talking the kind of blithering idiots who think AI is a great way to not need junior engineers anymore.

      1. rcw88

        Re: As well as brain drain at Amazon, there's brain rot (AI)...

        But if you don't have junior engineers where's your talent pipeline building a workforce for the future?..

        Oh yeah, we'll just pinch them from somewhere else, they arrive with no knowledge of your culture and customers, then FSCK off in two years, loyalty works both ways. But senior management and HR either don't give a shit because people don't matter anymore, they are just resources to be exploited.

    3. SnailyFresh

      Re: As well as brain drain at Amazon, there's brain rot (AI)...

      > loosening some restrictions on ChatGPT, including allowing it to write erotica for verified adults.

      As opposed to today, where they are all happy writing the most specifically-anatomical, "on page" rape scenes to any squirt with a gmail address. You don't want to know how many tweens sext with LLM chatbots daily.

  7. Blackjack Silver badge

    [Amazon suffers from 69 percent to 81 percent regretted attrition]

    That's a whole lot, that means the working environment is really bad.

    1. Michael Hoffmann Silver badge
      Unhappy

      It is.

      Though it very much depends on what team you manage to land in.

      My first first 10 months or so were amazing, heck, there's AWS code with my name on them (no, those were not involved in this outage, thanks for the snark). Then came a re-org. Within months of that, people started leaving in droves. The documented (as per 360s) worst rated manager and the psychopath who brought him on board, reworked an entire region in their image as personal empire.

      That's why they pay/paid so insanely well: golden shackles. If I had lasted I would no be VERY comfortably retired. Or my wife/widow would be, because I'm not ashamed to say that there were suicidal moments at the end. It was that bad.

    2. Nate Amsden Silver badge

      That poisonous environment sometimes follows employees after they leave. Encountered a few situations where ex amazon folks wanted to overhaul companies in the Seattle area often with disasterous results. I lived in the area 2000 till 2011, one of the reasons for leaving that region was to get away from those folks. By contrast, never had an issue with ex MS folks.

      NYT had a good article about their culture about 7 to 9 years ago. Explained a lot to me about co workers I had back then.

      1. Anonymous Coward
        Anonymous Coward

        I've consulted for MS Research for a decade (so, R&D arm, not corporate, but still) and even though I wasn't an employee, I can tell that the work culture was just fine. Not perfect, obviously, but definitely a place where I could have seen myself working a lifetime without going nuts, if the stars had aligned that way. Whereas all I hear of Amazon is crap piled up on crap.

  8. mikus

    The great Godaddy outage of 2012

    Everyone forgets when Godaddy went down in 2012, much of the internet stopped working. Certainly not because they hosted most of it, but because most used them as a DNS registrar. That day a config glitch sent them spiraling down to take down their entire DNS anycast network globally, causing glue records for some 70M domains to stop working at all. It took a good 12 hours to fix I think with network vendors involved.

    Much the same, too many eggs in one basket, and AWS has very large baskets.

    1. Nate Amsden Silver badge

      Re: The great Godaddy outage of 2012

      Being a dns registrar and providing authoritative dns are quite different. At the end of the day it is the root servers that point you to who is authoritative, not the registrar.(Of course the registrar is responsible for updating the root servers if there are changes but that is pretty rare during a domain's life)

      Company I was at in 2012 was a GoDaddy customer for registrar, though we used dyn for dns hosting, initially because GoDaddy wouldn't allow TTL of 60s or something. I don't recall any issues with them at any point but it was a long time ago...

      1. Anonymous Coward
        Anonymous Coward

        Re: The great Godaddy outage of 2012

        The point is that once people have registered their domain with GoDaddy (or whoever) they then tend to have have their DNS entries left on the GoDaddy system. Because, for the majority of them, why would they think to move them? You say the company you were at used another supplier to host your DNS, for a technical reason that doesn't matter to many of them*. Fine. But that won't be the usual practice. So the result is that GoDaddy will be left as the authoritative DNS that the root servers will pass lookups on to.

        * you mention a very short, for day to day purposes, TTL; could make guesses why you wanted that - still needing the dynamic DNS that dyn started up with?

        1. Nate Amsden Silver badge

          Re: The great Godaddy outage of 2012

          No, at the time in 2011 we did use godaddy but were hosted in Amazon cloud (moved out early 2012), and due to cloud wonkiness we sometimes had to point dns at a different address and wanted a low TTL for that. I don't recall specifically why now.

          I had previously used dyn at an earlier company (again did not use dynamic dns just their enterprise stuff and geo load balancing). So when management asked about the TTL I suggested dyn, and we stuck with them till the end(eventually migrated to OCI DNS which was dyn on rhe backend but I left the company not too long after that).

          When we moved out of cloud the TTL thing was no longer an issue.

          I do miss dyn though, the real dyn. Specifically the ability to update dns records then later be able to publish them in bulk(reviewing changes during publish), rather than update one entry at a time in real time. It wasn't cheap though, OCI DNS was about 99.5% cheaper. Company struggled with costs as time went on but for whatever reason nobody ever questioned the DNS bill nor the Internet provider bill.

          Current org uses GoDaddy as well but uses dns made easy as their dns provider.(Both established before I joined). It works fine, just not fond of the dns update process compared to dyn.(I only used OCI for a tiny bit I don't remember the interface).

  9. The Original Steve

    I did initially think that essentially things like IAM will ultimately be tied to a single region, and just like on-prem if AD DS goes down then you're screwed

    But then again I thought that AD DS doesn't really go down due to a server or datacentre failure due to it being multi-master. Why isn't AWS IAM the same?

    DNS is also multi-master, so combined they should fail due to an issue in a single region. I can understand issues occuring, but we've hammered these out on-prem for decades - surely the big cloud providers can surpass this?

  10. elsergiovolador Silver badge

    Acceptance

    Why would anyone care today? Everyone and their dog is on AWS, so they created a sense of "we are in this together".

    Site goes down, employees go to park, maybe they start the beers early.

    Everyone has accepted that the most important thing is the wealth of shareholders.

    When even governments accepted this?

    Just internalise that there is no future in engineering anymore. Ladders have been pulled.

  11. Nate Amsden Silver badge

    last dns outage I had

    Was about 5 years ago due to this bug

    https://kb.isc.org/docs/aa-01315

    Which caused several brief dns outages lasting a few seconds a piece randomly over several months(with all 8 of my recursive resolvers freezeing up at the same time, always auto recovered). Super difficult to trace the cause. Was annoying that someone had already reported it to Ubuntu but they did not fix it for some time.

    Before that the last dns outage I can recall was in 2016

    https://en.m.wikipedia.org/wiki/DDoS_attacks_on_Dyn

    I remember it was one of the few times I was on the data center floor working on things during this attack. People calling me saying our sites were down when they were not, took a couple of minutes to determine it was dyn that was having issues..

    Not sure what causes others to have outages related to dns, it's a pretty simple service.

    1. StewartWhite Silver badge
      Boffin

      Re: last dns outage I had

      "Not sure what causes others to have outages related to dns, it's a pretty simple service."

      You need to get with the programme! Simplicity is the enemy and must be prevented at source otherwise people would start to realise that "The smartest guys in the room" are just jumped up MBAs and "vibe" coders who couldn't manage their way out of a carrier bag.

      1. PinchOfSalt

        Re: last dns outage I had

        What's with the thing against MBAs?

        I see it all the time, yet there's never any explanation why this specific qualification is problematic.

        Tech vs marketing just seems a low-brow argument. As if tech had any purpose without marketing and marketing (now) without tech.

        I do not have an MBA, but have listened to those that do and do not find what they say offensive or worthy or ridicule.

        1. Jellied Eel Silver badge

          Re: last dns outage I had

          What's with the thing against MBAs?

          I see it all the time, yet there's never any explanation why this specific qualification is problematic.

          We're IT people & engineers. MBAs generally aren't. Many of us have worked with enough MBAs to know that they can be useless and dangerous. Of course this is a sweeping generalisation, and I have worked with some good ones. The good ones tended to have been ones who've got experience, then got an MBA. Ones that have been more.. challenging are those that went straight from grad school with their fresh MBA, and then think they understand business.

          1. Jason Hindle Silver badge

            Re: last dns outage I had

            That used to be much of our cadre of management and project management. People who knew how to rack and cable servers, and understood the products, who went onto studying for an MBA.

          2. cassandratoday

            Re: last dns outage I had

            The problem is people who try to do things they're not qualified for. You don't need an MBA to be that kind of person, and getting an MBA won't turn you into that kind of person.

        2. StewartWhite Silver badge

          Re: last dns outage I had

          Have a look at this article from Forbes (ironically the magazine of choice of the vainglorious MBA) explaining why MBAs think that they're above us sans-culottes: https://www.forbes.com/sites/ericjackson/2012/08/26/the-ten-most-dangerous-things-business-schools-teach-mbas/

          If you've ever listened to an MBA they'll have marked you down as an inferior simply for not having an MBA. I had the misfortune of working for Accenture for a few months and they pumped the management consultant MBAs full of the idea that at the age of 25 they knew way more than everybody else and could apply the Accenture playbook (literally a vast paper manual at the time) to any situation and it would work. It mostly seemed to recommend sacking lots of people (the older the better), cutting corners and removing costs as far as I could see. It was hardly revolutionary; just bog standard asset stripping dressed up - all fur coat and no knickers.

          For a good read on how Enron employed vast swathes of MBAs but went spectacularly and criminally bust try "The Smartest Guys in the Room" by Bethany McLean and Peter Elkind.

          1. Jellied Eel Silver badge

            Re: last dns outage I had

            If you've ever listened to an MBA they'll have marked you down as an inferior simply for not having an MBA.

            Also might need to keep a business dictionary handy to try to understand their wafflebollocks and manglement-speak. They can have an unfortunate habit of using their own MBA language, and lots of it. Which might be down to the MBA industry largely being designed around selling management books, and management consultancy. So the more meetings called and the more words used, the more billable hours.

            This can sometimes be entertaining though. Like telling consultants that their proposals need to be no more than 3 pages or slides, What? Why? and How Much? This is also where good MBAs can sometimes be useful, ie if manglement won't listen to their own underlings, get the MBA to present your ideas. Which might also be part of the cult thing, so manglement will accept the idea because it's been presented by a fellow cult member. Which can sometimes be a painful experience seeing expensive external consultants presenting your own ideas, diagrams and documents as their own work. And in one case, then threatening to sue for breach of copyright even though they were using our designs.

            1. Roland6 Silver badge

              Re: last dns outage I had

              >” Like telling consultants that their proposals need to be no more than 3 pages or slides”

              My favourite deliverable was a diagram I wrote on a blank sheet of paper as I talked my project sponsor through their problem and the required solution.

              The light bulb went on in the sponsor’s head. They took the piece of paper and told me to bill for the full agreed cost of the assignment and no they didn’t want a report.

              They then repeated the presentation to an internal meeting of their worldwide IT leaders, who in turn had light bulbs go off. My canny project sponsor got what he wanted, his IT team now owned the problem and it’s solution and thus actively worked to make it happen; yes I did play a major part in the solution delivery as the delivery partner not a supplier; so it was worth keeping quiet and playing the game.

              Interested much of my career has been this style of engagement, so little to actually point to and say I did this, but plenty of sealed brown envelopes in the garage if you really need evidence…

            2. Anonymous Coward
              Anonymous Coward

              Re: last dns outage I had

              ahh Accenture consultants re-presenting ideas previously raised and rejected by staff and watching management fawn over them and welcome their incredible insight. takes me back. I saw this happen so many times. We had a saying for their technique. "Steal your watch then charge to tell you the time"

          2. CrazyOldCatMan Silver badge

            Re: last dns outage I had

            all fur coat and no knickers.

            I thought that was a good thing?

        3. Decay

          Re: last dns outage I had

          "What's with the thing against MBAs?

          I see it all the time, yet there's never any explanation why this specific qualification is problematic."

          As some others have said not all MBAs are bad, I would argue that experienced industry people with good knowledge and ability in their chosen field who move on to MBAs are usually quite good. These are rare unicorns.

          Some light reading for you

          https://evonomics.com/want-to-kill-your-economy-have-mba-programs/

          https://www.advisorpedia.com/viewpoints/mba-thinking-can-ruin-businesses/

          https://rpc.cfainstitute.org/research/cfa-digest/2014/10/you-reap-what-you-sow-how-mba-programs-undermine-ethics-digest-summary

          https://hbr.org/2005/05/how-business-schools-lost-their-way

          https://www.theguardian.com/news/2018/apr/27/bulldoze-the-business-school

          Ok too much reading?

          Toys "R" US

          How management dropped the ball

          Heavy debt burden: After a leveraged buy-out in 2005, the company carried large debt, which limited its flexibility.

          Slow adaptation & innovation: TRU failed to keep up with the shift to online retail, changing consumer behaviour, and competition from big-box and e-commerce players.

          Focus on cost/finance instead of investment: Because of the financial burden and perhaps a bias towards cost metrics, the company under-invested in store experience, technology, and e-commerce capabilities.

          Strategic complacency: The business model (“toy supermarket”) became obsolete, yet the management did not pivot fast enough.

          MBA fingerprints all over it

          Short-term/financial driven: The debt-heavy structure and focus on servicing it limited long-term investment.

          Metrics over context: TRU arguably prioritized profitability, cost structure and upkeep of the model, rather than deeply rethinking the changing retail/consumer context.

          Tool/structure focus rather than adaptability: Despite being in a dynamic market, the company did not sufficiently embrace emergent strategies or culture change.

          The takeaway?

          Even strong brands can be undermined if management emphasizes financial engineering, cost metrics and static business models rather than continuous customer-/environment-led adaptation.

          Boeing 787 Dreamliner

          How management dropped the ball

          Extensive outsourcing: Boeing outsourced a large portion of manufacturing and global supply chain tasks (approx. 70% of work) expecting cost savings and efficiency.

          Complexity and control issues: The global network of suppliers resulted in loss of direct control, inadequate supply-chain visibility, quality issues, delays.

          Cost/efficiency metric focus: The drive to reduce internal manufacturing and transfer risk upstream may reflect a financially-driven management mindset (outsourcing to reduce capital cost) rather than builder/innovation-centric.

          Cultural/operational misalignment: Employees and insiders reported that quality was sidelined in favor of schedule/throughput.

          MBA fingerprints all over it

          “Manage what you can measure”: Manufacturing cost reduction, outsourcing metrics, schedule targets were emphasised; less visible were human/quality/craft/learning metrics.

          One-size tool/structure: A standard outsourcing model applied in a context (high-technology aerospace) where emergent adaptation and oversight were critical.

          Short-term/financial prioritization: The urgency to deliver and cut cost may have trumped investment in deeper quality and supply‐chain robustness.

          The takeaway?

          In highly complex, innovation-intensive fields, management strategies that focus too narrowly on cost and metrics (outsourcing, schedule) can produce systemic risk. Robustness and human/quality factors must be embedded from the start.

          Enron

          How management dropped the ball

          Financialization and short‐term metrics: Enron’s compensation and culture were extremely focused on meeting earnings targets, boosting stock price, special‐purpose structures and mark-to-market accounting to show growth.

          Ethics/complexity neglect: The financial engineering, opaque off-balance-sheet entities and risk‐taking were demonstrations of tool/metric-based management divorced from long-term value and stakeholder risk.

          Oversight failure / culture of measurement: The emphasis on financial benchmarks, growth illusions and internal incentives created a culture where the outcome of the tool-driven management was disastrous.

          MBA fingerprints all over it

          The obsession with shareholder value and financial metrics at the expense of broader accountability.

          A reliance on management models (mark-to-market, SPVs, bonuses) that look good on paper but neglected substance, ethics and stakeholders.

          The takeaway?

          When management treats metrics and financial engineering as ends in themselves rather than means to serve business purpose, the result can be catastrophic. The “toolbox” enables but does not guarantee good outcomes.

        4. Anonymous Coward
          Anonymous Coward

          Re: last dns outage I had - MBAs

          I'd like to know this too.

          I'm AC as I have an MBA. Yup, really. But I got mine after working up from an operator to a senior mainframe System Programmer, and then as SysProg Manager.

          After completing it I went in to IT commercials, account management and finance.

          But company changes meant a change of career and it was back to being a techie.

          1. SnailyFresh

            Re: last dns outage I had - MBAs

            Not to No True Scotsman you, but the MBAs who have real world experience to season the MBA disease of focusing on metrics ahead of everything, don't trouble the world with crazy ideas

        5. Anonymous Coward
          Anonymous Coward

          Re: last dns outage I had

          The key point appears to be whether or not they make damn sure you know they have an MBA before they overrule you.

          Instead of actually using their knowledge and giving you at least a brief rundown of how you are presenting something that is good in all the techie bits but here is why it is not the best course when taking into account these bits of the business as a whole.

    2. Jellied Eel Silver badge

      Re: last dns outage I had

      Not sure what causes others to have outages related to dns, it's a pretty simple service.

      Well, there was that time when a Livingston Portmaster connected to the consoles of a bunch of Sun servers got powercycled on the instructions of the sysadmins. Who decided they really didn't need to be in a London datacentre at 2am on a Sunday. Even though powercycling sent a break to all the servers, which put them into console mode and halted them. Which included a server with the cuddly name of X.root-servers.net

      Fun times, but losing the root-server was mostly just an embarrassment given there were others.

      But IMHO, a problem with DNS problems is, well, DNS and the way people are conditioned to connect via URI, which is fine, except DNS is the one ring that binds them. So then some simple stuff like how to connect to a server when it is unreachable by name. And then IPv6 can make that more FUN! when the name isn't as simple as 8.8.8.8 But why I like having out-of-band access to critical components because if (when) there's a DNS or BGP oopsie, it makes fixing it a whole lot simpler.

  12. xyz123 Silver badge

    Amazon literally tells its engineers not to step outside their assigned role or be fired.

    So they can't suggest a fix even if they know 100% what it will be. Thats exactly what happened this time. EIGHT different staff members pointed to the underlying cause and were told (some literally) to "shut the f*ck up and get back to your job"

    1. SarahC_

      How is that even a thing? Shut up and get on with your job......... sounds like dictatorship for no reason!?

      1. Anonymous Coward
        Anonymous Coward

        https://www.forbes.com/sites/katherinehamilton/2023/05/24/delivery-drivers-sue-amazon-for-being-forced-to-pee-in-bottles/

        https://www.bbc.co.uk/news/world-us-canada-56628745.amp

        Hopefully it’s better than for the delivery drivers many of whom are sub-contracted. I hope there aren’t bottles of piss in the US-East-1 DC to be kicked over.

        Victorian Workhouse - Dickens would have a field day today with Amazon, Apple/Starbucks union subterfuge and fucking country and staff over to save $1 by sending work to China or India.

      2. PinchOfSalt

        There is a reason, just not one we like.

        They are built like a machine. Each cog has its place and should not move from that place without specific instruction. Thinking is dangerous in such environment for it might break the machine and lead to chaos.

        It's effective but not enjoyable. Nor does it allow you to explore and find what you're really good at since you can't try things.

        It's what we used to call running the business by the 'idiot principle', where the supreme leader maintains themselves as the most clever person in the business, and treats all others as idiots.

        1. Antipode77

          Malicious compliance comes to mind.

    2. youwish

      got proof?

      1. Anonymous Coward
        Anonymous Coward

        No, but I got milk

        1. CrazyOldCatMan Silver badge

          No, but I got milk

          But no bananas!

    3. CrazyOldCatMan Silver badge

      Amazon literally tells its engineers not to step outside their assigned role or be fired

      Having worked for a few big US companies, this is *very* much how they operate - almost like the US military (which is why the have a high number of backend staff per combat soldier.. perish the thought that a Rifleman MOS should put their own tent up!)

  13. Doctor Syntax Silver badge

    "It's just a matter of which understaffed team trips over which edge case first, because the chickens are coming home to roost."

    I do love a mixed metaphor. And yes, he's right. Kudos for getting in his rebuttals before the handwaving starts.

    One reservation I've had about Amazon over the years, just as a shop customer, was just how thin the programming coverage was. Yes, they did things at an enormous scale but if the man in the van failed to make a delivery - or even part of a delivery it was quite unpredictable how things would pan out. An item went into a warehouse and didn't leave when it was due (probably nicked but possibly just in the wrong shelf) and nobody ever noticed. An item wasn't delivered to a locker in Yorkshire and next allegedly heard of in France while a courier turned up at the door to collect the return of the undelivered object (and wasn't surprised to be told there wasn't a return).

    The question is, if this is the error handling in what must be at least one of the world's biggest logistics operations, is what's really under the covers of AWS that much better?

    1. breakfast Silver badge

      The more I've dealt with Amazon's other tools, the less I trust AWS. Hard to believe they have incredible engineers in one section while teams elsewhere are so inept I wouldn't trust them with blunt scissors, never mind server infrastructure.

    2. Like a badger Silver badge

      "must be at least one of the world's biggest logistics operations,"

      According to Armstrong & Associates (who claim to be experts in third party logistics) Amazon are not only the world's largest third party logistics company, they are a staggering five times the size of the next largest player, DHL. Unfortunately it looks as though treating your employees like shit is a successful strategy for market domination.

    3. StewartWhite Silver badge
      Joke

      Mixing Metaphors

      My personal favourite mixed metaphors are:

      "Let's put it in cold storage on the back burner"

      "He put his arse on the line but it blew up in his face"

      "Don't count all your chickens in one basket until the eggs come home to roost"

      I'm here all week you know.

      1. breakfast Silver badge

        Re: Mixing Metaphors

        You'll be here longer if you keep waiting for those eggs to come home.

      2. MachDiamond Silver badge

        Re: Mixing Metaphors

        "I'm here all week you know."

        Is the veal any good?

        1. CrazyOldCatMan Silver badge

          Re: Mixing Metaphors

          Is the veal any good?

          *Shudder*

          I wouldn't - it's been the special for two weeks and we haven't sold any yet..

  14. Fara82Light

    Linkage

    Is there ANY evidence that directly links the outage to recent layoffs?

    1. TRT Silver badge

      Re: Linkage

      If there was it would be evidence of a direct and deliberate action of a disgruntled employee, i.e. well-poisoning. How could there be evidence in abstentia that the situation could have been avoided, contained or more immediately fixed? The ONLY way that could happen is if the eventual fix came from phoning up or getting tipped off by a former employee.

    2. Fara82Light

      Re: Linkage

      I'll take the number of downvotes as a firm "no".

      1. JoeCool Silver badge

        Re: Linkage

        That elizabethean cone is for your dog, not your IQ.

        The downvotes are for a silly and irrelevant question.

        There is direct evidence, in the AWS HR files.

        But who is going to assemble a report showing their corporate directives are annihlating their future ability to generatre profits ?

        That's why you have journalists linking the dots.

  15. GuldenNL

    Multi Region for a Reason

    Took a call from the CEO of a consulting firm in Europe who I provided some assistance to a couple of years ago due to the customer industry in the USA.

    I demanded that this customer's CTO sign a disclaimer that they were offered a multiple AWS region solution and they turned it down. I also was loud about going with West if they refused to go with multiregion. The consulting firm was nervous because I was assessment about it all, without being nasty or putting anyone down. Yes, I was the "know it all" but that's why I was there.

    Apparently someone at the customer called the account mgr today and accused them of not warning the customer of a single region. No idea who, nor if they were there at the time I made a stink, but likely not. Acct Mgr calls Project Mgr while customer is still on the line, PM shoots the doc over, Acct Mgr provides it with glee. Immediately informs the C-Level execs that this saved them and recommended using it every time.

    This know it all can't understand why large enterprises do not go multiregion, given they need Five Nines up time. The cost just isn't enough to rule it out.

    1. Kevin McMurtrie Silver badge

      Re: Multi Region for a Reason

      Multi-region is hard and it's expensive at AWS. Software has to deal with high latency, bandwidth fees, and multi-region consistency. Also, us-east-1 seems to host some AWS internal systems, so being in a different region doesn't mean everything keeps working.

      1. Doctor Syntax Silver badge

        Re: Multi Region for a Reason

        If we just take that a little further - multi-region seems to be either too hard or too expensive for AWS. The underlying problem here seems to be that the internal systems weren't multi-region su us-east-1 became a SPoF.

        Justin Garrison in the post linked in TFA suggested that before the brain-drain AWS core was staffed by some very good engineers. If they were good why did they end up with a SPoF? Surely they were bright enough to realise that. My guess is that Amazon manglement wouldn't spend money on "what if this goes wrong" just as they wouldn't in the S/W that underlies their logistics.

        1. Jellied Eel Silver badge

          Re: Multi Region for a Reason

          If they were good why did they end up with a SPoF?

          My guess is a combination of managing growth/complexity, cost and maybe siloing or forcing techies to stay in their lane. So maybe an assumption that there was enough resiliency within the region, ie multiple DCs so it shouldn't become a SPoF. But one of those things we'll probably never know, especially if there were 'reasons' why authentication services were located convenienty for Virginia and Maryland.

      2. Fred Daggy
        Angel

        Re: Multi Region for a Reason

        So ... you're saying ... Infrastructure is hard. Oh my Flying Spaghetti Monster ...

        It's almost like you're saying ... (1) you need well trained, loyal staff to do it well and (2) it's going to cost you some actual $CURRENCY to do it well but, (3) it will cost you a lot less $CURRENCY that ballsing it up.

        Or just outsource it to a Low Cost Country and have them run a server out of their bedroom. And close your eyes.

        1. This post has been deleted by its author

  16. LemonTree3

    Powerful writing

    I rarely comment, but wow, this was probably the most memorable line I've ever read in The Register! "Remember, there was a time when Amazon's "Frugality" leadership principle meant doing more with less, not doing everything with basically nothing. " Well done!

  17. Anonymous Coward
    Anonymous Coward

    ... increasingly demonstrates apparent disdain for your expertise.

    company nation that increasingly demonstrates apparent disdain for your expertise.

    A community service message brought to you by the former NIH and CDC.

    1. Anonymous Coward
      Anonymous Coward

      Re: ... increasingly demonstrates apparent disdain for your expertise.

      They've had years to remove marijuana from the Schedule 1 list, and they haven't. Its not a lack-of-employees problem.

  18. ComputerSays_noAbsolutelyNo Silver badge

    The good thing with vendor lock-in ...

    is that the customers, or should one say the hostages, can not leave, if you stuff it up.

    1. Elongated Muskrat Silver badge

      Re: The good thing with vendor lock-in ...

      This is possibly why vendor-neutral architectures, and shoving everything and their mother into a container, are becoming a big thing.

  19. roselan

    meanwhile

    AMZN is up 1.61%

    The market doesn't see this outage has an issue, if it sees it at all. AMZN is not AWS, but still.

    This will happen again.

  20. flayman Bronze badge

    Shared responsibility model

    You shouldn't have all your eggs in North Virginia. What the hell? Whatever happened to multi-region, multi-AZ, high availability, fault tolerance? A DNS resolution failure in US-EAST-1 should not bring down half the internet for half a day.

    1. Doctor Syntax Silver badge

      Re: Shared responsibility model

      Because having spent the money to get it working there they weren't going to spend more money to replicate the service in each region let alone even more money to let one region fall back to another if needed. Even from the outside it's plan to see that Amazon aren't interested in anyone asking "what if?" let alone putting in the code needed to handle the answers.

  21. TRT Silver badge

    Very disproportionate...

    I'd have expected a lot worse TBH. It was disproportionately limited!

  22. Seattle-Jeff

    Sh*t happens in large/complex systems

    My wife is an engineer at AWS for 10 years. She still has problems, grasping the enormity of what is the AWS cloud when systems get dis large in this complex problems will happen and yes, the industry will forgive them and move on lessons will be learned as they are in every outage, but the insinuation that all the experts have left AWS is absolutely absurd.

    1. Anonymous Coward
      Anonymous Coward

      Re: Sh*t happens in large/complex systems

      Given the level of attrition I suspect she hardly knows her (current) manager, none of the experts who were in her first team are still employed by Amazon and her internal network has been shot to shreds, so if she has a problem, finding someone to assist is a major undertaking and exercise in frustration as she tries to locate the relevant person and achieve her objective in the timescales her responsibilities require.

      This was the point when I lifted my head above the parapet and had a look around and decided it was time leave, taking the voluntary redundancy package in the full knowledge that few in the company had any real idea what it was I contributed, thus my parting gift was an empty seat when someone came looking, as I knew they would…

      1. youwish

        Re: Sh*t happens in large/complex systems

        how long ago did you leave AWS?

      2. CorwinX Silver badge

        Re: Sh*t happens in large/complex systems

        I had a similar thing with a major bank I worked for.

        I was tech-lead on a major overhaul of their global email system.

        Saw the way the wind was blowing, after five years, and decided to take a (very generous) redundancy package.

        Then the phone calls started...

    2. Anonymous Coward
      Anonymous Coward

      Re: Sh*t happens in large/complex systems

      There are smart people, and there are people who stay at Amazon.

      Not sure you really want to talk about your wife like that.

  23. anotherblowhard

    The sad reality is there is no (real) cost to Amazon or Microsoft or any of these behemoths

    Your services were down for a day, it affected your company, you couldn't sell, buy, market yourself... who cares.... certainly no one at Amazon, Microsoft or Google.

    They will still get their monthly cut, they will appease their shareholders with some marketing speak and it will all be forgotten in a week.

    And to be fair, you might have lost a few customers, but probably on balance others were just delayed not lost, I mean where are they going to go, can't withdraw my money from another bank if it is in your vault can I?

    The real concern is when (not if), the bigger, longer meltdown occurs that can't be fixed in a few hours - either because of politics, cut cables, AI slop in charge or as you suggest, no one left who knows what the hell to do. Then how well will your company do.

    Time to back up the cloud on local severs I think, at the very least it is much harder to take out thousands of companies all at once when they are physically distributed and independently configured and operated.

    1. Anonymous Coward
      Anonymous Coward

      Re: The sad reality is there is no (real) cost to Amazon or Microsoft or any of these behemoths

      They will still make 99.999% uptime.

      1. Michael Strorm Silver badge

        Re: The sad reality is there is no (real) cost to Amazon or Microsoft or any of these behemoths

        One day every 273 years?

    2. Anonymous Coward
      Anonymous Coward

      Re: The sad reality is there is no (real) cost to Amazon or Microsoft or any of these behemoths

      I have a client that had a guy decide Dropbox was a good place to store their files.

      When Dropbox had a screwup and lost a bunch of their stuff for about a week, I pointed out that I had a local backup copy of everything, a backup of that backup, and I could make it a network share in about 5 minutes. I got a shocked look, was told no, they'd wait for Dropbox. Once Dropbox got the recovery completed, the guy who decided Dropbox was a good idea told me to delete the local backup, and make a backup on AWS.

      Realizing that this guy was never gonna learn his lesson, I proceeded to slow-walk the AWS backup until he forgot he told me to do it, then I shut down the AWS account. I never deleted the local backups, just completely ignored that, even went behind his back to quietly upgrade the servers. A few months later when that guy left, I told his former partner what I'd done. Got a "Well I'm glad you ignored him, thank you for looking out for us!"

      There is no cloud, only somebody else's computer.

    3. Elongated Muskrat Silver badge

      Re: The sad reality is there is no (real) cost to Amazon or Microsoft or any of these behemoths

      The longer outage will probably come as the result of infiltration by a hacking group, or more likely nation-state actor, who will probably have either found a zero-day exploit, or planted a deliberate exploit into custom hardware. It's probably already there, waiting. According to Bruce Schneier, "AI" is already ahead of the game in finding new exploits.

  24. Anonymous Coward
    Anonymous Coward

    Amazon deserves to fail

    As a current employee at AWS, I can confirm much of what this article asserts: Amazon treats us like garbage.

    Rather than fostering its employees, it actively burns through people, repeatedly making terrible decisions which drive employees to leave the company. It's much-vaunted "Leadership Principles" reflect only the sociopathology of Bezos and Jassy, who have built a continuously hostile work environment that constantly reminds employees of their worthessness, burning people out at an alarming rate.

    In my time at the company, I have seen so many brilliant engineers leave the company either as victims of layoffs for the sake of "Frugality" or because they can no longer tolerate the hateful culture fostered by Bezos and Jassy. Amazon is a terrible employer. They deserve all of their failures.

    If you are always taking a dump on your employees, eventually you will have to lie in the mess you have made.

  25. Anonymous Coward
    Anonymous Coward

    As a 59 year old whose applications never gain even a reply

    Despite 40 years of experience and a name in two RFCs ....

    fuck 'em.

  26. Paul 14
    Mushroom

    "the market will forgive AWS this time"

    Maybe in the US - but I got really strong vibes from my network in the UK and Europe yesterday that senior people are VERY unhappy that their core business depends on infrastructure on another continent working correctly. Yesterday workloads in other AWS regions - that AWS makes a big song and dance of supposedly being isolated - were impacted by the US-East-1 issue.

    Unless AWS makes serious moves to solve the US-East-1 dependency problem, they will lose customers outside the US in droves. Execs and governments care about digital sovereignty. They want to know that they'll still be able to operate if Putin shreds transatlantic cables - AWS demonstrated yesterday that they're not up to that challenge, but worst of all, they eviscerated trust among senior people who believed the region-independence narrative.

    1. Anonymous Coward
      Anonymous Coward

      You need to be careful not to conflate poor user architectural choices with AWS issues. AWS lost two core global services yesterday, at least the control planes anyway (IAM was happily performing authN/authZ in other regions). That does not mean "AWS was down" as the press have been saying.

      The very large customer I'm working with right now experienced a tiny impact (we couldn't perform mutating operations on CloudFront, as the control plane is in us-east-1), however all services (hosted in a European region) continued to run very happily throughout the day.

      I'm horrified that some of the services which were impacted yesterday are reliant on us-east-1 availability, the *only* excuse I can see is if DDB global tables is core to your architecture. Everything AWS publishes stresses multi-AZ architecture, if you chose to disregard that for cost, complexity or latency reasons then you should be conscious of the risks.

      1. Anonymous Coward
        Anonymous Coward

        AWS IS a poor architectural choice. Putting your critical stuff on somebody else's computer is ALWAYS a poor architectural choice.

  27. Taliesinawen
    1. VicMortimer Silver badge
      FAIL

      Re: TLDR ..

      Don't link to Xitter. Lots of us can't see it, because we've got that malware site blocked.

  28. DaemonProcess

    how did this come to be?

    I can imagine one scenario - "why don't we put our DNS in DynamoDB because it's faster. We can also combine it with IAM and certificates so we can get all the info we need in one operation!"

    Works great until a bad update of the API DNS record stops anybody else from querying or correcting the error because its self-dependent.

  29. david1024

    This is how the game is played

    Supposing everything in the article is true, Amazon is training the next set of 'old heads' on their customers' dime... They have offloaded that expense. They will lose negligible numbers of customers. This is the benefit of being the leader/default option, and they are taking that money. And you'll pay--same as me while they train folks and when they get expensive, it'll be time for another round. :(

  30. Cruachan Silver badge

    Going to happen more and more as the big players gather more market share, until the bubble bursts and I spend the next 15 years undoing all the on-prem to cloud migrations I've done over the previous 15 (assuming of course that it's even an option to do so given that Microsoft, Atlassian and many others don't offer certain products on-premises anymore).

    1. Elongated Muskrat Silver badge

      "Based on someone else's proprietary product" is the new "someone else's computer". The future is building things on open architectures, so you can take that thing that currently runs on Azure and pop it onto a set of machines in your own data centre, running Linux or a similarly non-proprietary OS, if your "thing" even cares what OS it's running on, inside its container.

      The problems come when the big players borg the open source products and projects that just work, like GitHub, and slowly enshittify them.

  31. skelly28

    Sometimes it's BGP

  32. Swb

    Can't these DNS problems be solved by AI?

  33. youwish

    You're posting a lot of generalizations here. Obviously you don't know AWS or anyone that works there.

  34. CorwinX Silver badge

    Started work at an investment firm long ago

    Their previous IT guy had left very suddenly.

    First day (hour in fact).

    Half the PCs could connect to the net, others couldn't.

    Checked the network settings, set by DHCP, on working/not working.

    Completely different settings. Conclusion - there's two DHCP servers on the network.

    Walked into the server room, for the first time, and there's racks of gleaming HP kit and... a black mini-tower sitting on a rack shelf.

    One of these things is not like the others says I (Sesame Street ref).

    No-one knew what it did.

    Made an executive decision to power it off (I didn't have a logon for it), put a message out on the tannoy for everyone to reboot then went round the office doing IPConfig /Refresh on PCs not working.

    Turned out it was an old Domain Controller/Fileserver from a company they'd merged with and the departed muppet had just plugged it into the network without checking it's config!

    1. CorwinX Silver badge

      Re: Started work at an investment firm long ago

      Someone around here seems to have a problem with me.

      Every single thing I post gets an instant downvote, within seconds.

      The above post of mine is a completely accurate, factual and technical account of something that actually happened - with no opinions or embellishment whatsoever.

      You can't downvote reality.

      1. Anonymous Coward
        Anonymous Coward

        Re: Started work at an investment firm long ago

        "I don't know what this gear is, I'll just hit the power switch" is a good way to get yourself in a Reg story.

        You got stupid lucky. It was still incredibly stupid.

        1. CorwinX Silver badge

          Re: Started work at an investment firm long ago

          Incorrect.

          I went to every PC and checked if anyone was connected to it.

          I checked every server (I could logon to the rest) to see if they relied on it for anything.

          I *analysed* that it had a "rogue" config, was interfering with the network, and the risk of a particular unknown service going down briefly was better than half the PCs in the building not working.

          You want to offer a better way of solving the problem?

        2. CorwinX Silver badge

          Re: Started work at an investment firm long ago

          I checked that no PC or server was connected it and that, if it was performing some unknown function, that would probably better than half the PCs in the building not working.

      2. Elongated Muskrat Silver badge

        Re: Started work at an investment firm long ago

        There's some people here who downvote-on-sight. Usually it's the ones who turn any comments section into a discussion of the virtues of far-right politics. (This is not an invitation to do so here).

        1. Elongated Muskrat Silver badge

          Re: Started work at an investment firm long ago

          I see all three of them have found me today. Presumably that includes the one who always posts AC.

      3. Jellied Eel Silver badge

        Re: Started work at an investment firm long ago

        Someone around here seems to have a problem with me.

        Every single thing I post gets an instant downvote, within seconds.

        I wouldn't worry about it. Some haters just gotta hate. I've even attracted the wrath of a skiddy who's automated downvoting everything, plus created their own killfile. Which means I'm free to take the pish out of them.

        You can't downvote reality.

        Problem is some folks here wouldn't recognise reality, or the difference between facts & opinions, or subjective vs objective. But have an upvote anyway.

      4. Mike VandeVelde
        Pint

        Re: You can't downvote reality.

        I just love it when someone tells me I can't do something that I've plainly already done. It's right up there with someone USING the word "utilize", or wasting bandwidth with the "sentence" "it is what it is" and making like it has deep (or literally any) meaning or something. "Well it ain't what it ain't" is what I usually sagely reply with.

        Plus I also love people getting butthurt about likes on the internet. That's what actually triggered me right here. Reminds me of a cartoon about a humanitarian airdrop of a box full of cardboard cutouts of facebook likes to some suffering jungle tribe. "Raising awareness", that's another great one.

        P.S. I did not downvote you, not enough concern for what you said on my part over here, just "taking the piss" ;)

  35. IGnatius T Foobar !

    The real lessons learned

    How about maybe you just don't design systems that depend on a single AWS region? Better yet, don't design systems that depend on AWS at all? I've worked with several customers who will put different parts of their systems not only in different regions, but in different cloud providers entirely. Or they'll do the new trendy thing and build a hybrid of public and private cloud. Putting all your eggs in one basket is a great strategy if you're making an omelette, but not a great strategy for IT infrastructure.

    1. telveer

      Re: The real lessons learned

      multi-region is sufficient. multi-cloud is fine for hosting different types of workloads (for for purpose) but deploying same app across multiple clouds to overcome DR issues like these is a fools errand. we do multi-region in aws and didn't have any impact because we made a quick decision to failover (by 5AM ET on the day of).

      We did make some of our AWS hosted apps work in a different cloud to handle a totally different risk (like a regulation or some other disagreement with AWS that forces us to exit them) but that is a migration.

  36. doug_bostrom

    Given the limited number of vendors at this scale and the methodical portability encumbrances they've engineered, there is effectively no market force available here.

    The customers are hostages and they'll be sticking with Amazon until the dimmest bulb left running the show burns out.

  37. TeeCee Gold badge

    It's very, very simple.

    If you want to hire cheap, off-the-peg eejits to run your shit, then it must be simple, utterly standard and well-documented.

    If, on the other hand, you want to lead the market with bleedin' edge rocket science shit, then you need to pay bleedin' edge rocket scientists whatever the fuck they want to look after it.

  38. This post has been deleted by its author

  39. kurkosdr

    You know the rules

    Netowrking issues:-

    - It can't be DNS

    - All my DNS entries are OK, there is no possible chance it's DNS

    - It was DNS

    Media playback issues:

    - It can't be Widevine

    - I only want to watch 480p, there is no possible chance it's Widevine

    - It was Widevine

  40. glennsills@gmail.com

    Were was the AI

    AI system admins would have caught this is a second.

    1. The Oncoming Scorn Silver badge
      Coat

      Re: Were was the AI

      I upvoted you, I really really hope it was sarcasm though!

  41. Moving Along

    I interviewed for a SW dev position at AWS once. One of the interviewers INSISTED that functions in C++ "could only recure once" (he didn't even know the term maximum recursion depth to properly state that incorrect statement)

    I have no doubt about it the engineers who work at AWS are not the smartest crayons in the knife drawer.

    1. The Oncoming Scorn Silver badge
      Pint

      I have no doubt about it the engineers who currently work at AWS are not the smartest crayons in the knife drawer.

      FTFY - Have a beer!

  42. ldouglas

    The initial DNS issue?

    My question is, what was the initial DNS issue.

    I've seen no comments.

    Everyone is talking about what happen after the DNS fix on Monday, not Saturday (10/18/25).

    That's when my firewall blocked a connection with an app that has had no pass connection problems.

    The firewall blocked the connection because it was now being served up on a device located in Bulgaria.

    The app runs on the AWS platform.

    Is there a connection here?

    Larry

    1. flayman Bronze badge

      Re: The initial DNS issue?

      "My question is, what was the initial DNS issue."

      Probably a BGP misconfiguration. That's usually what it is.

  43. Vader

    Why didn't AI fix this or even identify it as a problem. Secondly the cloud as its always sold we will never go down, resiliency everywhere.

    Maybe companies should consider shifting from the cloud.

  44. hurricane51

    This Isn't New

    There is long-standing joke (my father first told it to me) that goes like this:

    A school (substitute any large entity) has their heating system go out in the middle of a particularly brutal winter. It’s an old system, and the only engineer who really understands it has retired.

    The administrator reaches out to him, but the man is reluctant to fix it. The administrator says he will pay $1000 (insert sufficiently large amount to account for inflation) if he can come right aways to look at it.

    The engineer arrives about an hour later. He examines the system, and reaches into his toolbox. He takes a wrench and lightly taps on a valve. Miraculously, the system springs to life, back to full function.

    When he presents his bill, it says “Repaired heating system $1000,”

    The administrator says, “You can’t possibly expect us to pay you $1000 for a simple tap on a pipe.”

    The engineer takes back the bill, writes out a new one. It reads:

    Tapping on pipe …. $1

    Knowing where to tap…. $999

    TOTAL… $1000

  45. Anonymous Coward
    Anonymous Coward

    I wonder if the "managing out" of everyone over 48 has also had an impact?

    Those of us who remember what went wrong last time this happened are now unemployable due to our age, while cheaper callow youths are preferred (less talking back and fewer requests for holidays or pay rises).

  46. Anonymous Coward
    Anonymous Coward

    Experience is no longer respected! Screw 'em!

    The reason I've recently decided to call it quits on my career in my mid 50s after 30+ years in the IT game. I've had enough of dealing with half-baked ideas, rushed projects done on the cheap by know-it-all supposed experts who will shoot you down in meetings 'cos their docs say you're wrong despite you having actually dealt with the exact same situation for real, you're just plain wrong.

    I don't have the energy to deal with this pointless shite and I know a load of other "grey beards" who've had enough and have already walked out and retired or are now just hanging around for a redundancy payout, and then it's off to retirement. I'm not saying I'm an expert but real world on the job experience has a lot to offer, but to youngsters it simply holds up the projects and no one wants some old fart getting in the way of their shiny new project by injecting calm and common sense.

    So sod 'em, I'm now seeing out my last 3 months, then you can give my cash and I'll never bother you ever again. I'm off to do something more worthwhile with my valuable time.

  47. Nifty

    I was watching a YouTube video yesterday by 'Retired MS Engineer' that posited a different theory than mere loss of tribal knowledge as a cause. Customers (like Lloyds bank that went down with AWS) not fully knowing what they're buying with AWS services. According to that engineer, better resilience is possible but at some cost and some serious duplication. Not to mention planned outage and recovery drills.

    If a knowledge base can't store tribal knowledge, that's another potential single point of failure.

  48. Anonymous Coward
    Anonymous Coward

    Bar Lowering

    Once upon a time, Amazon believed in bar raising. In fact, part of the justification of URAs (unregrettable attrition) targets was to continue to raise the bar.

    But for about 3 years now, they've been trying to reduce headcount (post over hiring during Covid) without triggering a stock market price crash. Instead of raising the URA target from 8% to 12 or 15%, allowing management to determine who would have the least effect on their team if fired, the company did made two poor decisions.

    1) they laid off a lot of people who were rated as HV1 (meeting the bar) without asking managers. The problem with that if that many teams would put new hires at HV1 while on their cash bonus to reserve stock grants for 2+yr team members, and a ton of new HV1s were top performers, just new.

    2) they purposely (but slowly) have made working there worse in order to get people to leave on their own (return to work, fewer promotions etc).

    The biggest issue with this is that it has the exact opposite effect of bar raising. The top talent that has the most marketable skills leave first. Then you're left with those are willing to tolerate the shit because they're too afraid to get in the open job market.

    AWS used to be the top tier spot for talent, sadly most have left or have been accidentally fired.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like