User topics

Article topics

Log in Sign up

AWS reveals it broke itself by exceeding OS thread limits, sysadmins weren’t familiar with some workarounds

Amazon Web Services has revealed that adding capacity to an already complex system was the reason its US-EAST-1 region took an unplanned and rather inconvenient break last week. The short version of the story is that the company’s Kinesis service, which is used directly by customers and underpins other parts of AWS’ own …

COMMENTS

Post your comment

House rules Send corrections

Add to 'My topics'

Monday 30th November 2020 05:30 GMT chuckufarley

I think they are Nerfing...

...the wrong object.

Single Points of Failure are BAD...unless you have Too Many Points of Failure. In that case the KISS dogma will never run over your karma.

As a private citizen that would be extremely put out if the entire world were to catch fire I cannot condone giving more control to a daemon that has fallen over so spectacularly. It's almost as if they fed SystemD'oh! steroids and expected good things to come of it.

5 16 Reply
1. Monday 30th November 2020 08:30 GMT Anonymous Coward
  
  Re: I think they are Nerfing...
  
  "Single Points of Failure are BAD...unless you have Too Many Points of Failure"
  
  Seems like they had both - they added servers that all needed to communicate with each other via micro-services (redundancy, yay!) but then each OS ran out of threads. The AWS explanation is quite well written and this all shows the complexity of scaling up...
  
  28 0 Reply
  1. Monday 30th November 2020 11:18 GMT Pascal Monett
    
    Agreed
    
    As much as I despise The Cloud (TM), I have to admit that the engineers working on it are really pushing the limits.
    
    Now, the amount of RAM on a server is no longer the problem - it's the amount of threads a CPU can handle that is.
    
    Wow. Is there anything we can't push to the brink ?
    
    14 3 Reply
    1. Monday 30th November 2020 12:44 GMT tip pc
      
      Re: Agreed
      
      "Now, the amount of RAM on a server is no longer the problem - it's the amount of threads a CPU can handle that is."
      
      it wasn't cpu threads but the amount of threads the OS could handle.
      
      They stated the will amend the limitation in the os and also use bigger servers instead of more servers.
      
      But yes still pushing limits of their cloud platform.
      
      18 0 Reply
    2. Monday 30th November 2020 13:33 GMT Cynic_999
      
      Re: Agreed
      
      ISTM that there is surely no inescapable need to have a separate thread for each server on the system. In fact doing so seems to me to be pretty inefficient. So instead of increasing the server capability, it would surely be better to re-write the method of operation so that it does not need to run so many separate threads - perhaps (as an off-the-cuff possiblility) by having one thread that polls each server round-robin style rather than having a dedicated thread for each server.
      
      10 0 Reply
      1. Monday 30th November 2020 16:22 GMT MacroRodent
        
        Re: Agreed
        
        There is a well-known and widely used solution to this problem: thread pools at the userland level. Your threading library hands out fake threads that map to a smaller number of OS-supported threads. The fake thread may use a different real thread at different activations. This works because the threads usually sleep most of the time anyway if they have been spawned for communications purposes, and it is easier to dynamically allocate the required data for a large number of threads at userland level, than in the kernel.
        
        13 0 Reply
        
        Monday 30th November 2020 16:32 GMT TimMaher
        
        Re: Agreed
        
        Have to agree with both of you on this point.
        
        Using a thread per server is an odd design and it can be ameliorated by using thread pools, which have been around for a very long time.
        
        7 0 Reply
      2. Monday 30th November 2020 16:25 GMT Warm Braw
        
        Re: Agreed
        
        This.
        
        "Does not scale" leads directly to "does not compute" in such scenarios.
        
        I can't glean a great deal of useful insight from the AWS post, but it does seem that there's a kind of circular dependency: the scaling and provisioning depends on higher-level services that don't work when there are scaling and provisioning issues. Calling up the protocol stack is a risky business, because it inevitably calls right back down.
        
        8 0 Reply
  2. Monday 30th November 2020 18:19 GMT Anonymous Coward
    
    Re: I think they are Nerfing...
    
    Sounds like they had DBA and network engineer bods doing stuff that should have also involved OS engineers. Seems to me they didn't even consider the issue of OS limits and the fact that no matter how much you virtualise stuff and put hardware in load balanced clusters, ultimately it all runs on operating systems and hardware that has physical limitations. Sometimes I get the feeling people who should know better forget this rather important fact.
    
    But hey, its Da Cloud! Its all magic and Just Works, right?
    
    7 0 Reply
    1. Monday 30th November 2020 18:32 GMT jake
      
      Re: I think they are Nerfing...
      
      "But hey, its Da Cloud! Its all magic and Just Works, right?"
      
      That's what Marketing told us, so it must be true!
      
      6 0 Reply
      1. Tuesday 1st December 2020 01:27 GMT Clunking Fist
        
        Re: I think they are Nerfing...
        
        Of course, otherwise they wouldn't have said it, Shirley?
        
        0 0 Reply
    2. Tuesday 1st December 2020 08:16 GMT mikepren
      
      Re: I think they are Nerfing...
      
      I think it's worse than that. I think there design is wrong, for massive scale. Status messages shouldn't Nedd to be p2p, that's what you have topics and messaging for. In the days of on pi rem app servers you used to have state replication like that (p2p) but as you scale you moved to a different paradigm, like a central HA dB, or some broadcast technology.
      
      0 0 Reply
2. Monday 30th November 2020 11:08 GMT Crypto Monad
  
  Re: I think they are Nerfing...
  
  > Single Points of Failure are BAD
  
  There was no SPOF here, but a complex, highly-coupled system, which also happened to have an N² scaling issue (every node talks to every other node).
  
  22 0 Reply
  1. Monday 30th November 2020 11:44 GMT Doctor Syntax
    
    Re: I think they are Nerfing...
    
    Like the man said, too many points of failure.
    
    9 0 Reply
  2. Monday 30th November 2020 13:09 GMT Anonymous Coward
    
    Re: I think they are Nerfing...
    
    Absolutely, but it could be argued that the single point of failure is having a single design for each of the nodes creating a mono-culture (it doesn't the reliability at a node level but affects the impact at a system-wide level).
    
    It also highlights the difficulties with realistic scaling in pre-deployment tests.
    
    2 1 Reply
Monday 30th November 2020 05:36 GMT amanfromMars 1

Re: Plan One

Plan one: use bigger servers.

Is that plan akin to building bigger aircraft carriers which just also creates a bigger and more vitally important to planned operations target whenever deployed in service of customer requests? Creating a Goliath for a David to vanquish via similar unexpectedly simple means?

Is not the problem still remaining to be solved one of secure and timely rapid communication across and between and within fleet servers of disruptive additions/problematic events which don't automatically, quite naturally, trigger panic overload conditions/systems meltdowns/command reorganisations?

The difficulty then to resolve, whenever something is automatically quite naturally triggered, is the answer is not natural and be of artificial and/or alien being/construction/phorm with that realisation further hindering necessary reform and preventing timely human resolution leading to greater outrages in future outages?

9 10 Reply
1. Monday 30th November 2020 05:39 GMT chuckufarley
  
  Re: Plan One
  
  Are you saying that AWS is a symptom of Humanity's Auto-Immunity Disorder?
  
  2 0 Reply
  1. Monday 30th November 2020 07:59 GMT amanfromMars 1
    
    Re: Plan One
    
    Are you saying that AWS is a symptom of Humanity's Auto-Immunity Disorder? .... chuckufarley
    
    No, although could/would it also be if not a result of Humanity's Auto-Immunity Disorder?
    
    1 3 Reply
2. Monday 30th November 2020 05:57 GMT mikepren
  
  Re: Plan One
  
  It's their immediate plan. It's going to take time to rearchitect from many to many to something more scalable, like a service mesh.
  
  2 1 Reply
  1. Monday 30th November 2020 06:56 GMT Anonymous Coward
    
    Re: Plan One
    
    Either you didn't read/understand what the problem was, or do not know what a service mesh is.
    
    5 0 Reply
3. Monday 30th November 2020 12:37 GMT MJB7
  
  Re: Plan One
  
  No. They are planning to shift from "many thousands of servers" to either "a few thousand servers" or "many hundreds of servers" - not "tens of servers".
  
  3 0 Reply
  1. Monday 30th November 2020 12:55 GMT Roland6
    
    Re: Plan One
    
    >No. They are planning to shift from "many thousands of servers" to either "a few thousand servers" or "many hundreds of servers"
    
    It does seem that Amazon have hit a horizontal scaling limit of a flat server net based on small increment (ie. single server) scaling. It would seem the natural solution would be to introduce some larger scaling unit; a natural unit would be to create a larger vserver that consists of hundreds/thousands of individual servers and adapt their service/server management strategy accordingly.
    
    Also it would seem that some form of broker service is going to be need. Both service orchestration patterns have been tried and tested over the decades, just need to be adapted for the cloud.
    
    4 0 Reply
4. Tuesday 1st December 2020 12:54 GMT Wayland
  
  Re: Plan One
  
  Throw better hardware at the problem is a sensible quick fix. However tuning the software is probably what's needed. The National Grid has similar problems with balancing it's load. Excellent in theory but seems to have a mind of it's own at the large scale.
  
  0 1 Reply
Monday 30th November 2020 06:26 GMT jake

From the Redmond school of repair.

"Turn it off and then back on again."

Would YOU trust these numpties with your corporate data?

13 16 Reply
1. Monday 30th November 2020 11:42 GMT Graham 32
  
  Re: From the Redmond school of repair.
  
  How would YOU returned a failed system back to a known state?
  
  11 0 Reply
  1. Monday 30th November 2020 18:27 GMT jake
    
    Re: From the Redmond school of repair.
    
    What failed system? My system has been running non-stop since January 1st, 1981 ... and the only reason it went down then was because I decided to reboot everything and come up from scratch during the world-wide NCP to TCP/IP transition.
    
    0 7 Reply
Monday 30th November 2020 06:32 GMT davef1010101010

Translation - They built a "Cloud of Cards" ?

And the cards got wet!

15 0 Reply
Monday 30th November 2020 08:39 GMT Greybearded old scrote

Foot, meet shotgun

When every server talks to every other server, then exponential combinations are not your friend.

Why no schadenfreude icon?

18 0 Reply
1. Monday 30th November 2020 14:14 GMT Muscleguy
  
  Re: Foot, meet shotgun
  
  Even the brain doesn’t do that. Childhood is much more interconnected than in adult brains but links get pruned, or should do. Synaesthesia results from a lack of normal pruning. Kids are naturally synaesthetic to some degree. Also have very low impulse control and are prone to tantrums.
  
  The stability of adult brains points to a distributed node model. There are orders of magnitude more connections in a brain than in any server farm. There are issues with always on, system malfunctions leading to spurious errors (hallucinations) and even system death result from lack of appropriate downtime.
  
  Silicon may avoid the downtime issue since it’s due to the need to take out the trash, metabolites being dumped into the cerebro-spinal fluid which then drains into the lymph none of which really applies to silicon.
  
  Surely the internet architecture of nodes points the way to how to structure large server farms?
  
  5 1 Reply
Monday 30th November 2020 09:38 GMT Piro

Potential enormous boost for AMD

There's nowhere else they can get very high core count cpus that still support multiple sockets

5 4 Reply
1. Monday 30th November 2020 23:49 GMT doublelayer
  
  Re: Potential enormous boost for AMD
  
  Only if they find three things: a) they can't get around the one thread per server thing, b) all their threads are putting too much pressure on the CPU, not just the OS's limits, and c) they still can't get around the one thread per server thing. So far, when they increase the CPU power available to the VMs, it's so they can reduce the total number of them rather than to get more concentrated compute. They could solve problem A by using a system that allocates threads to access requests rather than reserving one per server. Having done that, it seems unlikely that they'd experience problem B at all, based on their statements. If they did, they could always try to solve problem C by redesigning the system so it doesn't have quadratic scaling, for example by having certain nodes whose responsibilities are to contact subsets of the servers and keep that data available for servers in other zones. If all of those attempts fail, then AMD may have a cause to celebrate.
  
  2 0 Reply
Monday 30th November 2020 09:47 GMT rg287

So they're saying the cloud isn't infinitely scalable?

Who knew! ¯\_(ツ)_/¯

25 0 Reply
1. Monday 30th November 2020 10:09 GMT amanfromMars 1
  
  Who Knows?
  
  So they're saying the cloud isn't infinitely scalable? ..... rg287
  
  Surely it's much more a case of their saying the cloud isn't infinitely scalable safely without the odd occasional problematic security issue?
  
  0 4 Reply
  1. Monday 30th November 2020 11:07 GMT parperback parper
    
    Re: Who Knows?
    
    No matter how big your cloud, N-squared will kill you eventually.
    
    In this case number of front end server communication links.
    
    12 0 Reply
2. Monday 30th November 2020 16:17 GMT TRT
  
  The cloud is infinitely scalable, but you occasionally have to reboot not just the sky, but the entire universe.
  
  3 0 Reply
Monday 30th November 2020 10:06 GMT xyz

Anyone remember 9/11

When everyone had to drop back to static pages due to load. Don't suppose anyone on the BBC/AWS team was born then. Anyway, this is what worries me about the cloud, when the shit hits the fan your fallback options are zero.

10 0 Reply
1. Monday 30th November 2020 13:03 GMT A.P. Veening
  
  Re: Anyone remember 9/11
  
  Not everyone, the Fark-forum stayed up with running commentaries from eyewitnesses and there were also URLs which still showed dynamic content including live views on cnn.com (yes, I was in Europe and saw the second tower come down in a life stream while most of Europe wasn't even aware the first had come down).
  
  2 0 Reply
2. Monday 30th November 2020 18:18 GMT Pascal Monett
  
  Re: Anyone remember 9/11
  
  Indeed, and I will never stop saying that if your own server falls over, it's only you and your customers that are impacted. You have mitigation options, if you care to put the money on the table.
  
  When The Cloud(TM) falls over, it's you and millions of other people that are impacted, and the only thing you can do is sit and wait until Someone Else's Server comes back online.
  
  I do not see that as an advantage for any company that is serious about making money.
  
  4 0 Reply
  1. Monday 30th November 2020 18:36 GMT jake
    
    Re: Anyone remember 9/11
    
    "I do not see that as an advantage for any company that is serious about making money."
    
    Indeed. Today it would seem that companies aren't all that interested in making their investors a long-term profit, rather they are interested in baffling their investors with ~~bullshit~~ buzzwords.
    
    1 0 Reply
    1. Tuesday 1st December 2020 06:53 GMT amanfromMars 1
      
      Re: Anyone remember 9/11
      
      Indeed. Today it would seem that companies aren't all that interested in making their investors a long-term profit, rather they are interested in baffling their investors with ~~bullshit~~buzzwords. ...... jake
      
      You might like to realise that is the surreal state of markets today, jake. and is what is inventing market floats on the up side, and trying to keep afloat leaden and laden with crushing debt unicorn and zombie companies alike on the down side ...... with the biggest fools in present creation imagining it desirable and sustainable and never likely to flash crash catastrophically on their watch/in their life time ..... and just kicking the can of worms problem on down the road for their children and their childrens' children to suffer dealing with. Some parents those, eh? ...... Abuse doesn't even begin to describe the depravity of the activities perpetrated and perversely justified in practically all cases as being their legacy and for their own future good.
      
      :-) Just imagine, what are the chances of there being a long term catastrophic global pandemic emergency and the markets rising and being stronger and more valuable/valued than ever before whenever billions of folk are being denied even their basic necessary for living needs. It just wouldn't be accepted in reality, would it. The markets and that reality would be recognised as definitely bound to be rigged and a false economy in the thrall of right shady and shadowy non-government and non-state actors taking everybody else on the planet for useful fools on a useless ride/helter-skelter journey.
      
      Did y'all not get the earlier memo on the situation revealing the enigmatic conundrum and disease which eats you for its pleasures........ Network 1976 "The World is a business" GOD Speech scene Do you want it/need it in plain text/black and white too? That's not a problem and easily done if you do.
      
      0 1 Reply
  2. Tuesday 1st December 2020 08:39 GMT jmch
    
    Re: Anyone remember 9/11
    
    As a business owner, if I have my own servers it's up to my IT guys to keep it up, and restore it quickly if it falls over. If I'm using cloud, it's up to AWS or whoever's IT guys.
    
    I understand that good IT guys want to keep control, having confidence that they can have better uptime / easier resolution than with cloud. But not all businesses have good enough IT guys.
    
    So of course some business owners will think hey, maybe AWS or whoever will do a better job, or good enough job at a lower cost. And in many cases they would be right
    
    1 0 Reply
    1. Tuesday 1st December 2020 08:48 GMT jake
      
      Re: Anyone remember 9/11
      
      If it's in house, YOU have total control when it goes TITSUP on Friday afternoon. If it's on AWS, their IT guys might get around to giving a fuck on Monday. Or perhaps Tuesday. Maybe.
      
      2 0 Reply
Monday 30th November 2020 10:32 GMT six_tymes

meh, this time I don't believe their excuse. they got hacked and wont admit it. lol

0 12 Reply
1. Monday 30th November 2020 11:42 GMT six_tymes
  
  a lot of aws workers here I see. "ha ha"
  
  0 6 Reply
2. Tuesday 1st December 2020 12:54 GMT Wayland
  
  We may not love The Cloud but we worship the man in the sky behind the cloud.
  
  0 1 Reply
  1. Tuesday 1st December 2020 17:17 GMT jake
    
    We?
    
    Who is "we", Kemosabe?
    
    0 0 Reply
Monday 30th November 2020 11:18 GMT Anonymous Coward

Async

Using a worker thread per neighbour server - is this '90s programming? Each thread typically reserving a couple of megs of stack space, my guess only getting occasional messages. 12 hours to restore the system by adding 'a few', say 3, hundred servers at a time, so about 4,000 servers. An async/event driven model would likely scale better...

13 0 Reply
1. Monday 30th November 2020 18:11 GMT Anonymous Coward
  
  Re: Async
  
  Millennial coding.
  
  6 2 Reply
  1. Monday 30th November 2020 20:53 GMT disk iops
    
    Re: Async
    
    that's who they hire by the bucket-load and most of them are H1B at that. Did you honestly think they had actual CS degrees and wouldn't design something so stupid?
    
    The only way forward is to use key-partitioning (like S3 does) and stop being so damn cheap about refusing to use load-balancers since they have their own in-house design for Pete's sake and don't have to pay Citrix for their NetScalers anymore.
    
    I don't remember how fast the S3 infra converges to a single-system-image, but S3 has 3 distinct tiers for starters and about 350,000 servers globally that need to eventually register and share 'knowledge' about their peers. If Kinesis is not using the correct/latest 'chatter' protocol to discover it's swarm, they are fracking idiots.
    
    3 1 Reply
Monday 30th November 2020 11:19 GMT Elledan

Don't just throw more hardware at it

From the autopsy report it sounds like they A) built a really complicated structure ('shards', 'streams') with countless threads to communicate between nodes instead of a single (or pool of) comms thread(s), and B) committed the cardinal sin of not having error detection and graceful degradation.

While A) isn't an issue by itself, it made B) a lot worse. The real fix here would be to fix the issues in B), but it doesn't sound like they're doing that. Probably because writing good software and testing it under various scenarios costs time and money.

In short, they'll very likely be back at this exact same meltdown scenario in a matter of months or years.

11 1 Reply
1. Monday 30th November 2020 12:49 GMT fajensen
  
  Re: Don't just throw more hardware at it
  
  A) isn't an issue by itself, i
  
  I think it is *exactly* the issue: AWS making their services insanely complicated may help sell some "Amazon Cloud University"-tickets ala Cisco, but it also backfires because now your own staff needs to be trained for a significant part of the time they spend at their desks and when something blows up, it not readily apparent what precisely has blown or how to fix it.
  
  Trying to limit AWS access rights for an AWS Lambda service so that Russian Hax0rs and internet randos cannot blow up ones credit card or the service is absolutely not a trivial excersise. It is not surpricing that people leave open AWS services all of the time; The working examples in the AWS docs are all of the "chmod 777"-kind and it is head-explodingly difficult to get to a different state.
  
  6 0 Reply
  1. Monday 30th November 2020 20:55 GMT disk iops
    
    Re: Don't just throw more hardware at it
    
    because the people who write the code are your typical programmer who doesn't understand permissions or access rights - ie. has ZERO hands-on systems admin experience. So of course, 777 is the answer because that's how it has to be on their local laptop.
    
    0 0 Reply
    1. Tuesday 1st December 2020 06:23 GMT Anonymous Coward
      
      Re: Don't just throw more hardware at it
      
      They are admin on their windows laptop and run the process as their admin user, the chmod 777 makes zero sense to them, lucky all developers are now operations Devops bro’s and all the admins have been fired cause the developer can use an abstracted terraform program to do chmod 777 in his container and no one will ever know especially not the developer
      
      0 0 Reply
Monday 30th November 2020 11:21 GMT Mike 125

same old

1: Not enough threads?

2: Configure more threads

3: backto 1

----------------------

1: Too many cars?

2: Build more roads

3: backto 1

-----------------------

1: Not enough cheap meat?

2: Burn the Amazon for more cows (see what I did there)

3: backto 1

14 4 Reply
1. Monday 30th November 2020 16:19 GMT TRT
  
  Re: same old
  
  I particularly like the third one after they roasted Bezos in Spitting Image:10, which has to be one of the best skits ever.
  
  0 0 Reply
Monday 30th November 2020 13:22 GMT Martin hepworth

Open on failure

Would love to see ANY other cloud provider be this open this quick....

Stuff happens it's how you handle it that matters. Come on Azure/GCP we're looking at you

1 3 Reply
Monday 30th November 2020 13:39 GMT Anonymous Coward

Presenting the "Well-Architected Framework"

Proprietary many-to-many connections for edge cluster data sharing? looks like theyre not using their own cache product pills. Now we know.

0 0 Reply
1. Monday 30th November 2020 14:29 GMT Muscleguy
  
  Re: Presenting the "Well-Architected Framework"
  
  Maybe with their insider knowledge they know the offering is pants?
  
  3 0 Reply
Monday 30th November 2020 15:53 GMT andy 103

Isn't the point of cloud infrastructure that you can do more with less?

That's funny because most cloud providers including AWS are telling everyone the main point of their existence (as far as customers are concerned) is that you can do everything using fewer resources, therefore costing you less.

Except when it goes wrong. Then you need more resources to fix the problem.

Almost as if you hadn't bothered with any of it, everything would have been better.

5 1 Reply
Monday 30th November 2020 17:52 GMT Martin

And this is going to happen over and over. The old guys, like me, who worked with resource-limited systems and invented loads of neat tricks (which now should be standard patterns, in fact) are now getting old or retired. The young kids, straight out of college, don't realize that just because you've got scads of memory and CPU, doesn't mean you still can't run out of resources - and they implement something like this.

No error checking? No warning that the number of threads is getting too high? No comms thread pool, as someone else suggested? This was a system where someone just added a bit more capacity and suddenly the whole thing falls over?

Well, give them their due for admitting to it. But still, this is actually dreadtul. Whoever let that design go live should be strung up.

9 0 Reply
1. Monday 30th November 2020 18:09 GMT TeeCee
  
  I agree:
  
  ...to do so create new threads for each of the other servers...
  
  That should have raised a red flag in the bloody design phase for something that was always intended to scale across a metric fucktonne of servers.
  
  6 0 Reply
Monday 30th November 2020 18:08 GMT Claptrap314

n^2? Are you #*$&ing kidding me?

Having learned SRE by supporting Hangouts at Google, I know exactly why they have the front end servers all talking to each other. It does not matter. IT. CANNOT. SCALE. Software engineers completely rearchitect systems rather than implement n^2. If we cannot figure out how to do it ourselves, we ask for help. If that means bringing in a computer scientist, that's what we do. I'm not saying I've never delivered n^2 into prod. I'm saying I never delivered it into prod where scale would ever be a concern.

The worst part about it is this: "To speed restart, in parallel with our investigation, we began adding a configuration to the front-end servers to obtain data directly from the authoritative metadata store rather than from front-end server neighbors during the bootstrap process." In other words, Amazon switched to an n log n solution in order to dig themselves out of the hole, but then immediately went back to the old way. Brilliant.

Sure, cells will help. Sure, expanding the number of threads the OS supports will help. Now, where do you put the hard limit on the number of servers in the cell based on the number of OS threads so that no human overrides or changes it? No. Bad architect. No more colored markers. Fix the n^2 problem.

But, as is often the case, this jape is not over. Their systems, nearly overloading from processing new configuration information, were trying to handle customer traffic--and overloading completely. Let me let you in on a secret: "My job is to keep the network up. It's merely out of my good graces that I allow customers on at all." If your servers are returning 100% 500s while keeping themselves and the network healthy, that can be recovered quickly. If they fall over, or the network does, that is BAD. Really, really bad. Design your systems to drop 100% of traffic before they fall over.

Moving the critical services to dedicated servers is part of how you do that--that's a good move. But only part of it. As Google recently found out, you need a strict hierarchy for traffic. Configuration changes over everything. Critical logs next--but ever here, you require rollups & squelching. Server fleet health traffic next. Then you get into general status & servicing the customer.

The other issue, and Amazon appears to be feeling its way in this direction, is that you MUST have a clear understanding of your dependencies, and systems in place to handle the failures of these dependencies. Throwing capacity at a problem does not make it go away--it makes the eventual failure that much more spectacular.

10 0 Reply
1. Monday 30th November 2020 21:25 GMT disk iops
  
  Re: n^2? Are you #*$&ing kidding me?
  
  The S3 Index tier is a close analog. When we had to mass-bootstrap the tier the nodes fetched their config from a 'static' source before they fell back to 'chatter/gab' mode to converge. I can't remember how we partitioned node sets but the respective 'master's eventually got their immediate peers all registered and sent updates upstream to other cells and eventually every cell got wind of all the other cells. But we sure as HELL did not maintain N^2 active connections!!
  
  This was SOLVED 10 years ago by the S3 team (and probably the EBS team). Kinesis apparently didn't bother to avail themselves of the existing codebase.
  
  This is not an uncommon occurrence at AWS - the teams don't talk and apparently Jasse and his minions haven't beaten the individual service teams with the "REUSE THE GOD DAMN CODE!" hammer enough.
  
  4 0 Reply
Monday 30th November 2020 18:18 GMT IGnatius T Foobar !

Should be obvious...

AWS is too big, too dominant, and is AS A WHOLE a single point of failure. Use other cloud providers.

4 0 Reply
Monday 30th November 2020 18:22 GMT bazza

Profligacy?

One thread per server sounds, well, excessive...

0 0 Reply
1. Monday 30th November 2020 21:24 GMT Imhotep
  
  Re: Profligacy?
  
  But it's just one thread. Should be fine.
  
  *Well, except for that per server part, and everything going down.
  
  2 0 Reply

POST COMMENT House rules

Not a member of The Register? Create a new account here.

Other stories you might like

US-EAST-1 region is not the cloudy crock it's made out to be, claims AWS EC2 boss

It's the region where stuff gets stressed at scale first, says Dave Brown, as he plots variants of Amazon's Outposts

PaaS + IaaS 10 Apr 2024 | 4

AWS must pay $525M to cloud storage patent holder, says jury

Updated Computing giant will appeal ruling, which found infringement was not 'willful'

Storage 11 Apr 2024 | 22

Irish power crunch could be prompting AWS to ration compute resources

Exclusive Users report being pointed to other EU regions if they need more grunt

On-Prem 9 Apr 2024 | 109

UK govt office admits ability to negotiate billions in cloud spending curbed by vendor lock-in

Exclusive After slew of AWS deals signed under MoUs, CDDO says current approach might weaken its position

Public Sector 4 Apr 2024 | 98

Snowmobile, Amazon's truck-powered migration service, reaches the end of the road

Demand for bulk storage on wheels turned out to be wan

Storage 17 Apr 2024 | 2

AWS severs connection with several hundred staff

'Necessary,' 'focusing our efforts,' 'deliver maximum impact' ... sounds just like all the other tech layoffs lately

PaaS + IaaS 3 Apr 2024 | 13

Amazon to lure upstarts with $500K in AWS AI credits each

Come on in, drill into Anthropic and Mistral – that's not the sound of a door slamming shut behind you

AI + ML 2 Apr 2024 | 1

Cyberattack hits Omni Hotels systems, taking out bookings, payments, door locks

Updated As WhatsApp, Facebook Messenger, other Meta bits plus Apple stuff fall offline today

Security 3 Apr 2024 | 18

Datacenter outages are on the decline, but when they hit, they hit hard

Power snafus take limelight in latest downtime diary from Uptime Institute

On-Prem 2 Apr 2024 | 3

GenAI will be bigger than the cloud or the internet, Amazon CEO hopes

And Andy Jassy will happily take your money along the way

Off-Prem 11 Apr 2024 | 15

Microsoft hiring Inflection team triggers interest from EU's antitrust chief

All sorts of levers being pulled to lure AI developers from here, there, everywhere

AI + ML 5 Apr 2024 | 4

Stability AI reportedly ran out of cash to pay its bills for rented cloudy GPUs

Generative AI darling was on track to pay $99M on compute to generate just $11M in revenues

AI + ML 3 Apr 2024 | 22

The Register Biting the hand that feeds IT

About Us

Our Websites

Your Privacy

Situation Publishing

Copyright. All rights reserved © 1998–2024