AWS going AWOL last week is exactly why less is more in cloud server land • The Register Forums

Monday 30th November 2020 10:24 GMT amacater

https://www.theregister.com/2020/11/30/aws_outage_explanation/

Amazon broke their own systems because they didn't understand networking (which would have given them their digital dial tone) or their own Linux variant. They put in the wrong scaling factor and will have to rebuild accordingly. Not straightforward but symptomatic of something getting too big to model or understand :(

5 5 Reply

Monday 30th November 2020 10:59 GMT iron

Did the author read the previous article and Amazon blog post about this incident?

> Amazon is famously reluctant to disclose what goes on in incidents like this

That line makes me think not.

18 1 Reply

Monday 30th November 2020 11:18 GMT Anonymous Coward

AWS us-east-1 and reliability

We've got a couple of hundred servers running in different AWS regions and it's in us-east-1 where weird things happen. The AWS folk we've been in contact with during incidents haven't said it explicitly, but reading between the lines they seem to be facing the same issues I saw during the years when I was dealing with on-premises servers. I.e. bits have been stuck together with various scripts, probably in a hurry at times, and it is not 100% clear how it works at least some of the time.

My guess is that as North Virginia (us-east-1) is the original AWS region, it has more such hacks running than in other regions so it is more inclined to be unstable.

Personally I'd avoid hosting anything production-related in that region; there are plenty of other AWS regions available these days where we at least haven't had any issues.

8 0 Reply

Monday 30th November 2020 11:50 GMT Strahd Ivarius

Re: AWS us-east-1 and reliability

I am pretty sure that the outage had nothing to do with some 3-letters organisations also located in North Virginia...

2 7 Reply
Monday 30th November 2020 18:35 GMT K Cartlidge

Re: AWS us-east-1 and reliability

RE: My guess is that as North Virginia (us-east-1) is the original AWS region, it has more such hacks running than in other regions so it is more inclined to be unstable.

And us-east-1 is also treated as a bit of a special snowflake. There are things existing in that region even when running in EU regions. It's also the default region for EC2 endpoints that don't have one.

2 1 Reply
Monday 30th November 2020 18:46 GMT Mr.Nobody

Re: AWS us-east-1 and reliability

There has been an unwritten rule for a long time to never run anything in AWS us-east-1 if one wants it to work without issues. I have heard this for well over five years now.

But the underlying issue here with the Kenisis service is like many of the other outages that occur at AWS. No one else in the world has systems like this, because they are proprietary to AWS.

Even if someone did have a similar system, where there are no issues with scalability, possible failures and how to fix them. no other company is operating them at the scale AWS does. Things will break, and they will continue to cause outages like this one for years to come. There is just so much they don't know about their products and services, because they haven't had a failure wit them yet.

It brings me right back to the EBS failures they had a few years ago.

1) EBS volumes for a whole region went TITSUP.

2) Smart AWS customers had their EC2 systems set to reboot in another region when they went down just like AWS taught them to.

3) EBS storage systems in the other regions struggled mightily to boot up all these newly starting systems, and those regions suffered tremendous performance problems that essentially blew up everyone's day.

How these people had not planned on the possibility of a whole region failing and starting to boot up in all of the other regions was enough to understand that these sort of outages will just continue to happen at AWS and the other cloud vendors.

3 1 Reply

Monday 30th November 2020 11:18 GMT Steve Button

"There are just as many servers as before"

Wrong. Because for most organizations, even large ones, you can't pack them to the gills in the same say a super massive one like Amazon, Azure or GCP can. So, there are actually fewer servers. But still not server-less, just you don't have to care that there's a server at the end of it.

4 7 Reply

Monday 30th November 2020 12:54 GMT Mage

Thank You

It's good to see a Cloud article that's not PR spin.

There are applications where it's the best fit.

There are others that need solutions that are your own and where you know exactly what and where the computers are.

If you are really big, you should have your own data centres in different countries, basically your private cloud.

10 3 Reply

Monday 30th November 2020 13:18 GMT Packet

What bugs me is...

1. How much consumer companies rely on 'the cloud' for their products' functionality when shit like this breaks

2. How little their tech people know what has happened resulting in product replacement offers (all at cost to the company) which then doesn't fix the issue

3 0 Reply

Monday 30th November 2020 13:35 GMT avakum.zahov

Re: What bugs me is...

Ah, but the pointy haired bosses do not care about how much it costs the company. The only thing that matters to them is that they can pass the blame. The Cloud is someboy else's computer, right!? So, it is not our fault, we chose the best. It is Amazon's (Azure's, etc.) ...

Nothing new under the sun.

7 2 Reply
1. Tuesday 1st December 2020 13:56 GMT jmch
  
  Re: What bugs me is...
  
  "but the pointy haired bosses do not care about how much it costs the company"
  
  If cloud wasn't perceived to be cheaper than running their own servers, the beancounters would overrule the pointy haired bosses
  
  0 0 Reply
  1. Wednesday 2nd December 2020 10:54 GMT NetBlackOps
    
    Re: What bugs me is...
    
    Cloud is far more expensive over the long term. However, CAPEX vs. OPEX still the greatest observable absurdity of pointy-haired bossdom. Cloud leaves you so vulnerable in so many ways, yet they don''t care because bonuses all around.
    
    2 0 Reply
Monday 30th November 2020 14:26 GMT Flywheel

Re: What bugs me is...

Not just consumer companies though. It looks like an increasingly large number of gov.uk sites have content hosted on AWS - Beta Companies House is one.

1 0 Reply
1. Monday 30th November 2020 20:25 GMT ElectricPics
  
  Re: What bugs me is...
  
  They have little choice. UK government departments must use the Government Digital Services Marketplace, and public sector IT must adhere to the Cloud First policy.
  
  1 0 Reply

Monday 30th November 2020 14:10 GMT Anonymous Coward

Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)

"The point isn't that cloud providers haven't achieved very high degrees of reliability. They have. It's that you as their client don't have the tools to easily decide how much trade-off is good for you, or how much risk you're happy to cede to a company in short-term resilience or long-term lock-in."

Sums up the cloud in general, shirley?

7 0 Reply

Monday 30th November 2020 15:26 GMT MattPi

Re: Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)

Sums up the cloud in general, shirley?

The Cloud (as a whole) gives you the tools to be as resilient as you want to be, but it sounds like a number of companies (including Amazon Music, which is where I saw issues) didn't architect things well. The loss of US-EAST-1 or any single AZ shouldn't break serious apps. If you're extra serious, you run in AWS and something else (Azure, GCP, etc.) and even your own at something like Switch. Something breaking all that stuff at the same time means there's unlikely to be a working internet for clients to notice your outage.

It's all about how much you want to spend on good IT people.

1 0 Reply
1. Monday 30th November 2020 18:50 GMT Mr.Nobody
  
  Re: Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)
  
  While what you say is true, it eliminates many of the cost savings cloud does have to offer, namely using services instead of a server with an app on it (or many servers with many apps on them).
  
  Being cloud vendor agnostic is extremely expensive. If you have complicated products, you now need to have experts in both or three cloud providers, and you have to have all the infrastructure pieces for them to work together.
  
  1 0 Reply
Monday 30th November 2020 18:24 GMT yoganmahew

Re: Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)

On average, the cloud providers have very good reliability...

0 0 Reply

Monday 30th November 2020 17:30 GMT Claptrap314

You're talking out the wrong end again

Four weeks ago, you sprinkled magic faerie dust over elections & proclaimed "it's time" for one of the worst ideas of all time. Now, you're saying...what exactly?

The reliability miracle that the pundits were proclaiming over the cloud was of course hype. What no one knew a decade ago was that AWS did not even have the right people in the room to deliver five nines. And believe me, neither does the average shop that's handling less than 1kqps. That's almost everyone. Moreover, mixed environments create entirely new fail surfaces.

As for some meta-service, that's really just escalating hype. The cloud providers are HIGHLY motivated to make multi-cloud unnecessarily difficult. The only way I can see that happening is if their customers demand it. Think IBM demanding that Intel license the 286 to AMD. But Amazon, Google, and Microsoft are literally three of the four biggest market cap companies in the world. NO ONE is going to be able to demand that these stabilize and synchronize their offerings sufficiently that any one riding on top won't be one API change away from 0 availability.

Certainly, companies would like to have what the cloud hype proclaims. But that's going to require deeper changes than this article is even starting to address.

4 0 Reply

Tuesday 1st December 2020 23:10 GMT Mike 16

Five Nines

Long ago (well, about 15 years) and far away (Milpitas) I was "acquihired" (aka "Borged") by a major networking company.

"Onboarding" involved a pep talk by a high muckety-muck of the sales group, who pledged that we would deliver "nine fives" by the end of the year.

I glanced at a fellow bit of plankton and it was clear from their expression that they agreed: "Yep, that's about what this lot can hit"

3 0 Reply
1. Wednesday 9th December 2020 22:11 GMT Claptrap314
  
  Re: Five Nines
  
  Well, if they are monitoring their availability to the millisecond, that's a good start, at least!
  
  0 0 Reply

Monday 30th November 2020 20:59 GMT Dinanziame

"Too bad they won't live – but then again, who does?"

I've seen what you did there... Seen things you people wouldn't believe... like tears in rain.

3 0 Reply

Tuesday 1st December 2020 08:26 GMT Anonymous South African Coward

We migrated our emails to the cloudy office hosted by Microsoft, because of persistent DDoS attacks on our hosted exchange.

Time will tell if we made the right choice or not.

I don't trust the cloud, as it is whimsical, like Mr Murphy, and will leave you to hang should you have a critical meeting/email/whatever....

0 0 Reply

Topics

Special Features

Vendor Voice

Resources

COMMENTS

https://www.theregister.com/2020/11/30/aws_outage_explanation/

AWS us-east-1 and reliability

Re: AWS us-east-1 and reliability

Re: AWS us-east-1 and reliability

Re: AWS us-east-1 and reliability

"There are just as many servers as before"

Thank You

What bugs me is...

Re: What bugs me is...

Re: What bugs me is...

Re: What bugs me is...

Re: What bugs me is...

Re: What bugs me is...

Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)

Re: Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)

Re: Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)

Re: Tradeoffs the punters can't control (and don't have the relevant decision making info anyway)

You're talking out the wrong end again

Five Nines

Re: Five Nines

POST COMMENT House rules

Enter your comment

Add an icon

Other stories you might like

AWS must pay $525M to cloud storage patent holder, says jury

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

Irish power crunch could be prompting AWS to ration compute resources

UK govt office admits ability to negotiate billions in cloud spending curbed by vendor lock-in

Snowmobile, Amazon's truck-powered migration service, reaches the end of the road

AWS severs connection with several hundred staff

Amazon to lure upstarts with $500K in AWS AI credits each

GenAI will be bigger than the cloud or the internet, Amazon CEO hopes

Microsoft hiring Inflection team triggers interest from EU's antitrust chief

Stability AI reportedly ran out of cash to pay its bills for rented cloudy GPUs

Amazon finishes pumping $4B into AI darling Anthropic

EU antitrust cops probe Microsoft ties between Entra ID and 365 services

About Us

Our Websites

Your Privacy