back to article AWS admits to 'severely impaired' services in US-EAST-1, can't even post updates to Service Health Dashboard

AWS has delivered the one message that will strike fear into the hearts of techies working out their day before Thanksgiving Holidays: US-EAST-1 region is suffering a "severely impaired" service. At fault is the Kinesis Data Streams API in that, er, minor part of the AWS empire. The failure is also impacting a number of other …

  1. aregross


    Point of


    It'll get'cha everytime!

    1. Anonymous Coward


      A single point of failure translates into multiple dollars of profit for Bezos so don't expect it to change anytime soon. It's all about the bottom line, not the customer.

      1. Peter-Waterman1

        Re: But

        What single point of failure is that then? Don’t see that mentioned anywhere.

        1. Yes Me Silver badge

          Re: But

          Ironically enough, the largest distributed processing system in the Eastern USA appears to be a single point of failure. One of the benefits of virtualisation!

          1. Peter-Waterman1

            Re: But

            Root cause:

            At 9:39 AM PST, we were able to confirm a root cause, and it turned out this wasn’t driven by memory pressure. Rather, the new capacity had caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration. As this limit was being exceeded, cache construction was failing to complete and front-end servers were ending up with useless shard-maps that left them unable to route requests to back-end clusters.


  2. Nate Amsden

    what a great day

    I guess that's all I had to say, moved the org I work for out of their cloud about 9 years ago now saving roughly $1M/year in the process. Some internal folks over the years have tried to push to go back to a public cloud because it's so trendy, but they could never make the cost numbers come close to making it worth while so nothing has happened.

    1. Anonymous Coward
      Anonymous Coward

      Re: what a great day

      What a load of tosh. So your company was all in on public cloud 9 years ago were they? Must have been a real trailblazer to have been spending enough to ‘save’ that much money by moving off? And what exactly were they using that they managed to save $1M/year by rebuilding the whole lot on-prem? And I guess you’ve not had to refresh/repair/service any of that hardware 2-3 times in the last 9 years?

      Regardless of what you think about cloud, if your bean counters can’t make the numbers stack up then you need some new bean counters. There are reasons not to go to cloud, but your post is complete fiction.

      1. Nate Amsden

        Re: what a great day

        Actually just retired some of our earliest hardware about 1 year ago. A bunch of DL385 G7s, an old 3PAR F200, and some Qlogic fibre channel switches. I have Extreme 1 and 10 gig switches that are still running from their first start date of Dec 2011(they don't go EOL until 2022). HP was willing to continue supporting the G7s for another year as well I just didn't have a need to keep them around anymore. The F200 went end of life maybe 2016(was on 3rd party support since).

        Retired a pair of Citrix Netscalers maybe 3 years ago now that were EOL, current Netscalers EOL in 2024(bought in 2015), don't see a need to do anything with them until that time. Also retired some VPN and firewall appliances over the past 2-3 years as they went EOL.

        I expect to need major hardware refreshes starting in 2022, and finishing in 2024, most gear that will get refreshed will have been running for at least 8 years at that point. Have no pain points for performance or capacity anywhere. The slowdown of "moore's law" has dramatically extended the useful life of most equipment as the advances have been far less impressive these past 5-8 years than they were the previous decade.

        I don't even need one full hand to count the number of unexpected server failures in the past 4 years. Just runs so well, it's great.

        As a reference point we run around ~850 VMs of various sizes. Probably 3-400 containers now too, many of which are on bare metal hardware. Don't need hypervisor overhead for bulk container deployment.

        The cost savings are nothing new, been talking about this myself for about 11 years now since I was first exposed to the possibility of public cloud. The last company I was at was spending upwards of $300k/mo on public cloud. I could of built something that could do the handle their workloads for under $1M. But they weren't interested so I moved on and they eventually went out of business.

        1. Martin an gof Silver badge

          Re: what a great day

          Your tone is a little supercilious sorry, but

          most gear that will get refreshed will have been running for at least 8 years at that point

          I'm not at all in the end of the business that needs massive data centres, so I'm not really qualified to comment, other than saying that "everything in (someone else's) cloud" has always seemed wrong to me, but I think the above is a key point.

          From the dawn of "commodity" hardware (i.e. since running your business on a minicomputer stopped being the done thing) until about ten or twelve years ago, a 24 to 36 month refresh cycle for most kinds of commodity IT kit would bring huge step-improvements each time for just about any common workload. Faster processing, faster storage, faster networking, lower power consumption, smaller physical size, lower purchase cost and all this at a time when the amount of "data" handled by typical businesses was growing exponentially.

          At that point, one big selling point of renting other people's infrastructure was that you could benefit from this refresh cycle without doing anything more than paying the bills. The downsides - primarily loss of control - didn't seem so bad.

          These days? l don't know for sure, but I get the feeling that the amount of data handled by typical businesses is still growing, but more slowly while the service life of commodity hardware components has increased and speeds - of processing, networking, storage - remain "good enough" for much longer. Crumbs, it's twelve years (I think) since I bought my first gigabit switch at home and I am not desperately saving up for 10Gbit kit (though I am considering adding a second network connection to the NAS). At my place of work, gigabit has only become available at the desktop within the last 18 months, interconnects are only now being upgraded to 10Gbit (requires fibre replacement), and most desktops are actually still connected at 100Mbit and boot from SATA-connected spinning rust. It's "good enough" for Outlook, Teams, Word, Excel, Powerpoint, web, some minor photo editing etc.

          Thus one of the big selling points of cloud has disappeared, and the main remaining attraction is that when something goes wrong it is "someone else's problem". It's a different equation.

          At this point it's usually worth considering the comparison between cloud services and organisations which outsource building maintenance and cleaning, or even making a comparison with the Private Finance Initiative for public projects. In many cases it has been proven that the projected savings are not realised in practice, and the inconvenience is often a major factor - consider a large school which no longer has a "caretaker" on premises during open hours, for example. Some child breaks a tap in the chemistry lab and there's water spraying everywhere? Fred the caretaker would be there in five minutes with a spanner, maybe even a spare tap. Mitie on a four-hour call-out probably arrive after the end of the school day and have sent an electrician to do a plumbing job because at least they meet their four-hour commitment.


          1. Doctor Syntax Silver badge

            Re: what a great day

            When something goes wrong it's still your problem, you've just put the ability to solve it into somebody else's hands.

        2. Robert Grant

          Re: what a great day

          > The last company I was at was spending upwards of $300k/mo on public cloud. I could of built something that could do the handle their workloads for under $1M.

          Does that include staffing your solution? And was their cloud solution cost-optimised, or could you have got that cost down as well?

      2. Anonymous Coward
        Anonymous Coward

        Re: what a great day

        Replace your bean counters until the numbers stack up...... What a telling suggestion.

        There are reasons people want to move to public cloud, some rational, some irrational (I've heard a few about strategy, but they are mainly about wanting opex or being "on trend"), but if you do it to save money, you'll likely be disappointed, or have to spend a lot of money to refactor all your apps in the 1st place.

        I get that startups love public cloud - they don't have the finances to build infrastructures to deliver what they hope they might need, but if they get large enough, and need to save money, they'll likely do a "Dropbox".

        I like the idea of hybrid cloud (without getting yourself locked in to any particular public cloud) - otherwise you are basically opening your chequebook whilst handcuffing yourself to your chosen cloud vendor. Who in their right mind would do that (other than someone who didn't know what they were doing)?

        Public cloud seldom is cheaper (which I agree doesn't stop people moving there), but I do wish more people could at least count and understand what they were signing up for, and if they don't, perhaps they should stop suggesting people fire their bean counters for being able to count/being honest (directly aimed at AC comment).

        1. brianpope

          Re: what a great day - beancounters

          Often beancounters' hands are tied to company 1/4ly results and other rules that mean OPEX is easier to sanction and handle than CAPEX, which can take time to realize its potential.

          I've seen opportunities for cost savings and better technical environments to be overlooked because funds are in "the wrong budget".

      3. Anonymous Coward
        Anonymous Coward

        Re: what a great day

        Or just maybe he has a bean counter that can actually count beans properly?

      4. a_builder

        Re: what a great day

        I did read it and wonder too.

        However, our experience is that you do get better real performance and save costs with on premises.

        We too were pretty much fully cloudy in 2012 and reversed a lot of that.

        I agree about the slowdown in CPU speed/power saving. So the rush to upgrade is smaller.

        For most larger SME’s a mirror DC can be a closet sized rack with 1U servers. With the growing ubiquity of 1G and 10G FTTP the mirror site does not need to cost fortunes. There is generally a lot of space in existing server rooms as things have shrunk. I just asked around a few other CEO’s I knew and we agreed a mutual zero cost deal.

        With a bit of imagination it is all doable.

        The trick is to keep the upgrade cycle rolling slowly - no Big Bang must spend gazillion CAPEX this year stuff. That is what sets of Bean Counter Central.....I’m lucky I’m the majority shareholder....

  3. Forget It

    AWS gone off a cliff ... since Tim Bray left!

    Bork Bork Bork

    and it not even the weekend!

  4. Claptrap314 Silver badge

    I learned SRE at Google

    We would not even look at a service unless it was in three separate DCs.

    I know that Amazon's regions & AZs don't map directly to Google's DCs, but if you haven't learned by now, US-East alone cannot provide five nines. You MUST go multi-region at least for that. Last I heard, AWS charges the big bucks for inter-regional data flows. Enough that it is probably worthwhile to look into the competition.

    BTW, I'm available if you need detailed instruction/help to make it happen... ;)

    1. Anonymous Coward
      Anonymous Coward

      Re: I learned SRE at Google

      So that means you learned to discontinue every service you provide, and that your “customers” really are the “service”?

      1. Tomato42

        Re: I learned SRE at Google

        no, he figured out, that if there's no service, there's nothing to break, so you get 100% uptime!

        1. yoganmahew

          Re: I learned SRE at Google

          And if anything under 5 minutes is not an interruption, then trebles all round!

      2. Claptrap314 Silver badge

        Re: I learned SRE at Google

        :D Have one--->

        My job was to enable management to deliver the nines. What they chose to do with that power was their business.

    2. mmccul

      Re: I learned SRE at Google

      Having seen what is required to reliably provide a legally mandated five-9s service, I believe that no cloud, even multi-region, multi-cloud I've looked at comes close. I've said for years that AWS delivers about 99.7% uptime, year over year. A service that needs five nines, meaning only 5 minutes of downtime (planned or not) per year? That takes a very different mindset about processes like change management, redundancy, etc.

      Smaller services are more likely to deliver higher reported uptimes because there are fewer moving parts that can cause widespread failures, but *any* issue impacting the overall service impacts your availability numbers. I don't give credit for "planned" service downtime.

      The rule of thumb I still teach is every nine after the first one adds a zero to your cost estimate.

      Some day, I'd love a discussion/briefing on what it takes to deliver the six-nines service level.

      1. Claptrap314 Silver badge

        Re: I learned SRE at Google

        On the inside, Google's distributed configuration data storage service actually delivered so much reliability that they would deliberately take it down for one minute at the end of each quarter. This was because the system was NOT specced at 100% uptime, but they were concerned that if it never went down, people would come to expect that.

        Those deliberate outages were the only ones that it had for the sixteen months I was there.

        I've not compared GCP to Google's internal systems to compare what they are promising to what is/was going on from the inside. In the four years since, the reported outages suggest that they are still having some difficulties driving the software engineering principles required for success.

        I've heard that "10x a nine" ruberic before. Not having personally driven a service from two nines to five, I cannot say with certainty, but there is clearly something like that involved. With two nines and a tight ship, you can get by with planned outages. To actually deliver three nines, no planned outages, 24/7 oncall, and DCs that are isolated geographically and electrically. To deliver four nines, you require automated rollbacks, switchovers, and scaling; and oncall people awake at all times, probably with "cover me--I'm going to use the facilities". Yeah, five nines requires all of that fully matured, plus full redundancy in your inter-DC communications.

        And since SRE hasn't been around for a decade yet, six nines is a theoretical thing.

  5. disk iops

    pricing can fix this pretty damn quick. Make US-east-1 double the price of US-WEST-1 or US-EAST-2 (ohio). Problem is the infra in Seattle and Ohio isn't being built big enough to moving 40% of us-east-1 into it.

  6. sanmigueelbeer

    Bring your stuff to the cloud, they say. Guaranteed reliability.

  7. Anonymous Coward
    Anonymous Coward

    "is all about dealing with real-time data, such as telemetry from IoT devices"

    How much useless traffic are we pumping through networks and DCs just to make true marketers wet dreams? One day we will drown in such data tsunami....

  8. Pascal Monett Silver badge

    "severely impaired"

    Yay !

    Once again the thing that high-flyers got insane bonuses out of to move the company to The Cloud (TM) has fallen flat on its face and millions are impacted.

    Well done !

    One day, you'll learn.

    One day.

  9. Kevin McMurtrie Silver badge

    Microservice hell

    AWS is going where Google is. The layers of internal microservices are so layered and tangled that not only is everything a single point of failure, but the failure is so complete that the origin can't be found. Restoration is a parallel brute force effort.

  10. Anonymous Coward
    Anonymous Coward

    what actually broke? I haven't seen a single issue on the internet anywhere.

  11. Doctor Syntax Silver badge

    "elemetry from IoT devices"

    Every cloud has a silver lining.

  12. onefastskater

    A message from iRobot ... Roomba

    Contact iRobot Customer Care

    An Amazon AWS outage is currently impacting our iRobot Home App. Please know that our team is aware and monitoring the situation and hope to get the App back online soon. Thank you for your understanding and patience.

    Was about to factory reset my Roomba when I came upon this note. Had to pass the vacuum cleaner the old fashion way!

    1. Strahd Ivarius Silver badge

      Re: Colin Wilson 2 - Apple have got this right!

      The Rise of the Machines is foiled, again!!!

  13. Diogenes8080

    History repeats itself

    "Werner said we don't need a DR site for US-East-1 because we have multiple availability zones."

    "If you work for VMware, Cisco or Apache leave the room."

    I wonder if a "defective storage bot" was to blame this time?

    1. Claptrap314 Silver badge

      Re: History repeats itself

      Multi-AZ outages, especially in US-East-1, have been a thing since the introduction of AZs.

      Amazon's systems were not originally architected for HA. That's almost has hard to retrofit as security.

  14. Anonymous Coward
    Anonymous Coward

    The Cloud

    Other peoples computers you have no control over.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like