back to article AWS going AWOL last week is exactly why less is more in cloud server land

Serverless is the new hotness. Like so much corporate IT, it's a complete misnomer. There are just as many servers as before, but your tasks – or microservices, if you need four more syllables – have no idea which ones they're using. Same meat, different stew. On Wednesday last week, Amazon decided to take serverless at face …

  1. amacater

    https://www.theregister.com/2020/11/30/aws_outage_explanation/

    Amazon broke their own systems because they didn't understand networking (which would have given them their digital dial tone) or their own Linux variant. They put in the wrong scaling factor and will have to rebuild accordingly. Not straightforward but symptomatic of something getting too big to model or understand :(

  2. iron Silver badge

    Did the author read the previous article and Amazon blog post about this incident?

    > Amazon is famously reluctant to disclose what goes on in incidents like this

    That line makes me think not.

  3. Anonymous Coward
    Anonymous Coward

    AWS us-east-1 and reliability

    We've got a couple of hundred servers running in different AWS regions and it's in us-east-1 where weird things happen. The AWS folk we've been in contact with during incidents haven't said it explicitly, but reading between the lines they seem to be facing the same issues I saw during the years when I was dealing with on-premises servers. I.e. bits have been stuck together with various scripts, probably in a hurry at times, and it is not 100% clear how it works at least some of the time.

    My guess is that as North Virginia (us-east-1) is the original AWS region, it has more such hacks running than in other regions so it is more inclined to be unstable.

    Personally I'd avoid hosting anything production-related in that region; there are plenty of other AWS regions available these days where we at least haven't had any issues.

    1. Strahd Ivarius Silver badge
      Devil

      Re: AWS us-east-1 and reliability

      I am pretty sure that the outage had nothing to do with some 3-letters organisations also located in North Virginia...

    2. K Cartlidge

      Re: AWS us-east-1 and reliability

      RE: My guess is that as North Virginia (us-east-1) is the original AWS region, it has more such hacks running than in other regions so it is more inclined to be unstable.

      And us-east-1 is also treated as a bit of a special snowflake. There are things existing in that region even when running in EU regions. It's also the default region for EC2 endpoints that don't have one.

    3. Mr.Nobody

      Re: AWS us-east-1 and reliability

      There has been an unwritten rule for a long time to never run anything in AWS us-east-1 if one wants it to work without issues. I have heard this for well over five years now.

      But the underlying issue here with the Kenisis service is like many of the other outages that occur at AWS. No one else in the world has systems like this, because they are proprietary to AWS.

      Even if someone did have a similar system, where there are no issues with scalability, possible failures and how to fix them. no other company is operating them at the scale AWS does. Things will break, and they will continue to cause outages like this one for years to come. There is just so much they don't know about their products and services, because they haven't had a failure wit them yet.

      It brings me right back to the EBS failures they had a few years ago.

      1) EBS volumes for a whole region went TITSUP.

      2) Smart AWS customers had their EC2 systems set to reboot in another region when they went down just like AWS taught them to.

      3) EBS storage systems in the other regions struggled mightily to boot up all these newly starting systems, and those regions suffered tremendous performance problems that essentially blew up everyone's day.

      How these people had not planned on the possibility of a whole region failing and starting to boot up in all of the other regions was enough to understand that these sort of outages will just continue to happen at AWS and the other cloud vendors.

  4. Steve Button Silver badge

    "There are just as many servers as before"

    Wrong. Because for most organizations, even large ones, you can't pack them to the gills in the same say a super massive one like Amazon, Azure or GCP can. So, there are actually fewer servers. But still not server-less, just you don't have to care that there's a server at the end of it.

  5. Mage Silver badge
    Happy

    Thank You

    It's good to see a Cloud article that's not PR spin.

    There are applications where it's the best fit.

    There are others that need solutions that are your own and where you know exactly what and where the computers are.

    If you are really big, you should have your own data centres in different countries, basically your private cloud.

  6. Packet

    What bugs me is...

    1. How much consumer companies rely on 'the cloud' for their products' functionality when shit like this breaks

    2. How little their tech people know what has happened resulting in product replacement offers (all at cost to the company) which then doesn't fix the issue

    1. avakum.zahov

      Re: What bugs me is...

      Ah, but the pointy haired bosses do not care about how much it costs the company. The only thing that matters to them is that they can pass the blame. The Cloud is someboy else's computer, right!? So, it is not our fault, we chose the best. It is Amazon's (Azure's, etc.) ...

      Nothing new under the sun.

      1. jmch Silver badge

        Re: What bugs me is...

        "but the pointy haired bosses do not care about how much it costs the company"

        If cloud wasn't perceived to be cheaper than running their own servers, the beancounters would overrule the pointy haired bosses

        1. NetBlackOps

          Re: What bugs me is...

          Cloud is far more expensive over the long term. However, CAPEX vs. OPEX still the greatest observable absurdity of pointy-haired bossdom. Cloud leaves you so vulnerable in so many ways, yet they don''t care because bonuses all around.

    2. Flywheel
      Unhappy

      Re: What bugs me is...

      Not just consumer companies though. It looks like an increasingly large number of gov.uk sites have content hosted on AWS - Beta Companies House is one.

      1. ElectricPics

        Re: What bugs me is...

        They have little choice. UK government departments must use the Government Digital Services Marketplace, and public sector IT must adhere to the Cloud First policy.

  7. Anonymous Coward
    Anonymous Coward

    Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)

    "The point isn't that cloud providers haven't achieved very high degrees of reliability. They have. It's that you as their client don't have the tools to easily decide how much trade-off is good for you, or how much risk you're happy to cede to a company in short-term resilience or long-term lock-in."

    Sums up the cloud in general, shirley?

    1. MattPi

      Re: Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)

      Sums up the cloud in general, shirley?

      The Cloud (as a whole) gives you the tools to be as resilient as you want to be, but it sounds like a number of companies (including Amazon Music, which is where I saw issues) didn't architect things well. The loss of US-EAST-1 or any single AZ shouldn't break serious apps. If you're extra serious, you run in AWS and something else (Azure, GCP, etc.) and even your own at something like Switch. Something breaking all that stuff at the same time means there's unlikely to be a working internet for clients to notice your outage.

      It's all about how much you want to spend on good IT people.

      1. Mr.Nobody

        Re: Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)

        While what you say is true, it eliminates many of the cost savings cloud does have to offer, namely using services instead of a server with an app on it (or many servers with many apps on them).

        Being cloud vendor agnostic is extremely expensive. If you have complicated products, you now need to have experts in both or three cloud providers, and you have to have all the infrastructure pieces for them to work together.

    2. yoganmahew

      Re: Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)

      On average, the cloud providers have very good reliability...

  8. Claptrap314 Silver badge
    Facepalm

    You're talking out the wrong end again

    Four weeks ago, you sprinkled magic faerie dust over elections & proclaimed "it's time" for one of the worst ideas of all time. Now, you're saying...what exactly?

    The reliability miracle that the pundits were proclaiming over the cloud was of course hype. What no one knew a decade ago was that AWS did not even have the right people in the room to deliver five nines. And believe me, neither does the average shop that's handling less than 1kqps. That's almost everyone. Moreover, mixed environments create entirely new fail surfaces.

    As for some meta-service, that's really just escalating hype. The cloud providers are HIGHLY motivated to make multi-cloud unnecessarily difficult. The only way I can see that happening is if their customers demand it. Think IBM demanding that Intel license the 286 to AMD. But Amazon, Google, and Microsoft are literally three of the four biggest market cap companies in the world. NO ONE is going to be able to demand that these stabilize and synchronize their offerings sufficiently that any one riding on top won't be one API change away from 0 availability.

    Certainly, companies would like to have what the cloud hype proclaims. But that's going to require deeper changes than this article is even starting to address.

    1. Mike 16

      Five Nines

      Long ago (well, about 15 years) and far away (Milpitas) I was "acquihired" (aka "Borged") by a major networking company.

      "Onboarding" involved a pep talk by a high muckety-muck of the sales group, who pledged that we would deliver "nine fives" by the end of the year.

      I glanced at a fellow bit of plankton and it was clear from their expression that they agreed: "Yep, that's about what this lot can hit"

      1. Claptrap314 Silver badge
        Trollface

        Re: Five Nines

        Well, if they are monitoring their availability to the millisecond, that's a good start, at least!

  9. Dinanziame Silver badge
    Windows

    "Too bad they won't live – but then again, who does?"

    I've seen what you did there... Seen things you people wouldn't believe... like tears in rain.

  10. Anonymous South African Coward Bronze badge

    We migrated our emails to the cloudy office hosted by Microsoft, because of persistent DDoS attacks on our hosted exchange.

    Time will tell if we made the right choice or not.

    I don't trust the cloud, as it is whimsical, like Mr Murphy, and will leave you to hang should you have a critical meeting/email/whatever....

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like