back to article AWS reveals it broke itself by exceeding OS thread limits, sysadmins weren’t familiar with some workarounds

Amazon Web Services has revealed that adding capacity to an already complex system was the reason its US-EAST-1 region took an unplanned and rather inconvenient break last week. The short version of the story is that the company’s Kinesis service, which is used directly by customers and underpins other parts of AWS’ own …


  1. chuckufarley Silver badge

    I think they are Nerfing...

    ...the wrong object.

    Single Points of Failure are BAD...unless you have Too Many Points of Failure. In that case the KISS dogma will never run over your karma.

    As a private citizen that would be extremely put out if the entire world were to catch fire I cannot condone giving more control to a daemon that has fallen over so spectacularly. It's almost as if they fed SystemD'oh! steroids and expected good things to come of it.

    1. Anonymous Coward
      Anonymous Coward

      Re: I think they are Nerfing...

      "Single Points of Failure are BAD...unless you have Too Many Points of Failure"

      Seems like they had both - they added servers that all needed to communicate with each other via micro-services (redundancy, yay!) but then each OS ran out of threads. The AWS explanation is quite well written and this all shows the complexity of scaling up...

      1. Pascal Monett Silver badge


        As much as I despise The Cloud (TM), I have to admit that the engineers working on it are really pushing the limits.

        Now, the amount of RAM on a server is no longer the problem - it's the amount of threads a CPU can handle that is.

        Wow. Is there anything we can't push to the brink ?

        1. tip pc Silver badge

          Re: Agreed

          "Now, the amount of RAM on a server is no longer the problem - it's the amount of threads a CPU can handle that is."

          it wasn't cpu threads but the amount of threads the OS could handle.

          They stated the will amend the limitation in the os and also use bigger servers instead of more servers.

          But yes still pushing limits of their cloud platform.

        2. Cynic_999 Silver badge

          Re: Agreed

          ISTM that there is surely no inescapable need to have a separate thread for each server on the system. In fact doing so seems to me to be pretty inefficient. So instead of increasing the server capability, it would surely be better to re-write the method of operation so that it does not need to run so many separate threads - perhaps (as an off-the-cuff possiblility) by having one thread that polls each server round-robin style rather than having a dedicated thread for each server.

          1. MacroRodent Silver badge

            Re: Agreed

            There is a well-known and widely used solution to this problem: thread pools at the userland level. Your threading library hands out fake threads that map to a smaller number of OS-supported threads. The fake thread may use a different real thread at different activations. This works because the threads usually sleep most of the time anyway if they have been spawned for communications purposes, and it is easier to dynamically allocate the required data for a large number of threads at userland level, than in the kernel.

            1. TimMaher Silver badge
              Thumb Up

              Re: Agreed

              Have to agree with both of you on this point.

              Using a thread per server is an odd design and it can be ameliorated by using thread pools, which have been around for a very long time.

          2. Warm Braw Silver badge

            Re: Agreed


            "Does not scale" leads directly to "does not compute" in such scenarios.

            I can't glean a great deal of useful insight from the AWS post, but it does seem that there's a kind of circular dependency: the scaling and provisioning depends on higher-level services that don't work when there are scaling and provisioning issues. Calling up the protocol stack is a risky business, because it inevitably calls right back down.

      2. boltar Silver badge

        Re: I think they are Nerfing...

        Sounds like they had DBA and network engineer bods doing stuff that should have also involved OS engineers. Seems to me they didn't even consider the issue of OS limits and the fact that no matter how much you virtualise stuff and put hardware in load balanced clusters, ultimately it all runs on operating systems and hardware that has physical limitations. Sometimes I get the feeling people who should know better forget this rather important fact.

        But hey, its Da Cloud! Its all magic and Just Works, right?

        1. jake Silver badge

          Re: I think they are Nerfing...

          "But hey, its Da Cloud! Its all magic and Just Works, right?"

          That's what Marketing told us, so it must be true!

          1. Clunking Fist Bronze badge

            Re: I think they are Nerfing...

            Of course, otherwise they wouldn't have said it, Shirley?

        2. mikepren

          Re: I think they are Nerfing...

          I think it's worse than that. I think there design is wrong, for massive scale. Status messages shouldn't Nedd to be p2p, that's what you have topics and messaging for. In the days of on pi rem app servers you used to have state replication like that (p2p) but as you scale you moved to a different paradigm, like a central HA dB, or some broadcast technology.

    2. Crypto Monad

      Re: I think they are Nerfing...

      > Single Points of Failure are BAD

      There was no SPOF here, but a complex, highly-coupled system, which also happened to have an N2 scaling issue (every node talks to every other node).

      1. Doctor Syntax Silver badge

        Re: I think they are Nerfing...

        Like the man said, too many points of failure.

      2. Anonymous Coward
        Anonymous Coward

        Re: I think they are Nerfing...

        Absolutely, but it could be argued that the single point of failure is having a single design for each of the nodes creating a mono-culture (it doesn't the reliability at a node level but affects the impact at a system-wide level).

        It also highlights the difficulties with realistic scaling in pre-deployment tests.

  2. amanfromMars 1 Silver badge

    Re: Plan One

    Plan one: use bigger servers.

    Is that plan akin to building bigger aircraft carriers which just also creates a bigger and more vitally important to planned operations target whenever deployed in service of customer requests? Creating a Goliath for a David to vanquish via similar unexpectedly simple means?

    Is not the problem still remaining to be solved one of secure and timely rapid communication across and between and within fleet servers of disruptive additions/problematic events which don't automatically, quite naturally, trigger panic overload conditions/systems meltdowns/command reorganisations?

    The difficulty then to resolve, whenever something is automatically quite naturally triggered, is the answer is not natural and be of artificial and/or alien being/construction/phorm with that realisation further hindering necessary reform and preventing timely human resolution leading to greater outrages in future outages?

    1. chuckufarley Silver badge

      Re: Plan One

      Are you saying that AWS is a symptom of Humanity's Auto-Immunity Disorder?

      1. amanfromMars 1 Silver badge

        Re: Plan One

        Are you saying that AWS is a symptom of Humanity's Auto-Immunity Disorder? .... chuckufarley

        No, although could/would it also be if not a result of Humanity's Auto-Immunity Disorder?

    2. mikepren

      Re: Plan One

      It's their immediate plan. It's going to take time to rearchitect from many to many to something more scalable, like a service mesh.

      1. Anonymous Coward
        Anonymous Coward

        Re: Plan One

        Either you didn't read/understand what the problem was, or do not know what a service mesh is.

    3. MJB7

      Re: Plan One

      No. They are planning to shift from "many thousands of servers" to either "a few thousand servers" or "many hundreds of servers" - not "tens of servers".

      1. Roland6 Silver badge

        Re: Plan One

        >No. They are planning to shift from "many thousands of servers" to either "a few thousand servers" or "many hundreds of servers"

        It does seem that Amazon have hit a horizontal scaling limit of a flat server net based on small increment (ie. single server) scaling. It would seem the natural solution would be to introduce some larger scaling unit; a natural unit would be to create a larger vserver that consists of hundreds/thousands of individual servers and adapt their service/server management strategy accordingly.

        Also it would seem that some form of broker service is going to be need. Both service orchestration patterns have been tried and tested over the decades, just need to be adapted for the cloud.

    4. Wayland Bronze badge

      Re: Plan One

      Throw better hardware at the problem is a sensible quick fix. However tuning the software is probably what's needed. The National Grid has similar problems with balancing it's load. Excellent in theory but seems to have a mind of it's own at the large scale.

  3. jake Silver badge

    From the Redmond school of repair.

    "Turn it off and then back on again."

    Would YOU trust these numpties with your corporate data?

    1. Graham 32

      Re: From the Redmond school of repair.

      How would YOU returned a failed system back to a known state?

      1. jake Silver badge

        Re: From the Redmond school of repair.

        What failed system? My system has been running non-stop since January 1st, 1981 ... and the only reason it went down then was because I decided to reboot everything and come up from scratch during the world-wide NCP to TCP/IP transition.

  4. davef1010101010

    Translation - They built a "Cloud of Cards" ?

    And the cards got wet!

  5. Greybearded old scrote Silver badge

    Foot, meet shotgun

    When every server talks to every other server, then exponential combinations are not your friend.

    Why no schadenfreude icon?

    1. Muscleguy Silver badge

      Re: Foot, meet shotgun

      Even the brain doesn’t do that. Childhood is much more interconnected than in adult brains but links get pruned, or should do. Synaesthesia results from a lack of normal pruning. Kids are naturally synaesthetic to some degree. Also have very low impulse control and are prone to tantrums.

      The stability of adult brains points to a distributed node model. There are orders of magnitude more connections in a brain than in any server farm. There are issues with always on, system malfunctions leading to spurious errors (hallucinations) and even system death result from lack of appropriate downtime.

      Silicon may avoid the downtime issue since it’s due to the need to take out the trash, metabolites being dumped into the cerebro-spinal fluid which then drains into the lymph none of which really applies to silicon.

      Surely the internet architecture of nodes points the way to how to structure large server farms?

  6. Piro

    Potential enormous boost for AMD

    There's nowhere else they can get very high core count cpus that still support multiple sockets

    1. doublelayer Silver badge

      Re: Potential enormous boost for AMD

      Only if they find three things: a) they can't get around the one thread per server thing, b) all their threads are putting too much pressure on the CPU, not just the OS's limits, and c) they still can't get around the one thread per server thing. So far, when they increase the CPU power available to the VMs, it's so they can reduce the total number of them rather than to get more concentrated compute. They could solve problem A by using a system that allocates threads to access requests rather than reserving one per server. Having done that, it seems unlikely that they'd experience problem B at all, based on their statements. If they did, they could always try to solve problem C by redesigning the system so it doesn't have quadratic scaling, for example by having certain nodes whose responsibilities are to contact subsets of the servers and keep that data available for servers in other zones. If all of those attempts fail, then AMD may have a cause to celebrate.

  7. rg287 Silver badge

    So they're saying the cloud isn't infinitely scalable?

    Who knew! ¯\_(ツ)_/¯

    1. amanfromMars 1 Silver badge

      Who Knows?

      So they're saying the cloud isn't infinitely scalable? ..... rg287

      Surely it's much more a case of their saying the cloud isn't infinitely scalable safely without the odd occasional problematic security issue?

      1. parperback parper

        Re: Who Knows?

        No matter how big your cloud, N-squared will kill you eventually.

        In this case number of front end server communication links.

    2. TRT Silver badge

      The cloud is infinitely scalable, but you occasionally have to reboot not just the sky, but the entire universe.

  8. xyz

    Anyone remember 9/11

    When everyone had to drop back to static pages due to load. Don't suppose anyone on the BBC/AWS team was born then. Anyway, this is what worries me about the cloud, when the shit hits the fan your fallback options are zero.

    1. A.P. Veening Silver badge

      Re: Anyone remember 9/11

      Not everyone, the Fark-forum stayed up with running commentaries from eyewitnesses and there were also URLs which still showed dynamic content including live views on (yes, I was in Europe and saw the second tower come down in a life stream while most of Europe wasn't even aware the first had come down).

    2. Pascal Monett Silver badge

      Re: Anyone remember 9/11

      Indeed, and I will never stop saying that if your own server falls over, it's only you and your customers that are impacted. You have mitigation options, if you care to put the money on the table.

      When The Cloud(TM) falls over, it's you and millions of other people that are impacted, and the only thing you can do is sit and wait until Someone Else's Server comes back online.

      I do not see that as an advantage for any company that is serious about making money.

      1. jake Silver badge

        Re: Anyone remember 9/11

        "I do not see that as an advantage for any company that is serious about making money."

        Indeed. Today it would seem that companies aren't all that interested in making their investors a long-term profit, rather they are interested in baffling their investors with bullshit buzzwords.

        1. amanfromMars 1 Silver badge

          Re: Anyone remember 9/11

          Indeed. Today it would seem that companies aren't all that interested in making their investors a long-term profit, rather they are interested in baffling their investors with bullshitbuzzwords. ...... jake

          You might like to realise that is the surreal state of markets today, jake. and is what is inventing market floats on the up side, and trying to keep afloat leaden and laden with crushing debt unicorn and zombie companies alike on the down side ...... with the biggest fools in present creation imagining it desirable and sustainable and never likely to flash crash catastrophically on their watch/in their life time ..... and just kicking the can of worms problem on down the road for their children and their childrens' children to suffer dealing with. Some parents those, eh? ...... Abuse doesn't even begin to describe the depravity of the activities perpetrated and perversely justified in practically all cases as being their legacy and for their own future good.

          :-) Just imagine, what are the chances of there being a long term catastrophic global pandemic emergency and the markets rising and being stronger and more valuable/valued than ever before whenever billions of folk are being denied even their basic necessary for living needs. It just wouldn't be accepted in reality, would it. The markets and that reality would be recognised as definitely bound to be rigged and a false economy in the thrall of right shady and shadowy non-government and non-state actors taking everybody else on the planet for useful fools on a useless ride/helter-skelter journey.

          Did y'all not get the earlier memo on the situation revealing the enigmatic conundrum and disease which eats you for its pleasures........ Network 1976 "The World is a business" GOD Speech scene Do you want it/need it in plain text/black and white too? That's not a problem and easily done if you do.

      2. jmch Silver badge

        Re: Anyone remember 9/11

        As a business owner, if I have my own servers it's up to my IT guys to keep it up, and restore it quickly if it falls over. If I'm using cloud, it's up to AWS or whoever's IT guys.

        I understand that good IT guys want to keep control, having confidence that they can have better uptime / easier resolution than with cloud. But not all businesses have good enough IT guys.

        So of course some business owners will think hey, maybe AWS or whoever will do a better job, or good enough job at a lower cost. And in many cases they would be right

        1. jake Silver badge

          Re: Anyone remember 9/11

          If it's in house, YOU have total control when it goes TITSUP on Friday afternoon. If it's on AWS, their IT guys might get around to giving a fuck on Monday. Or perhaps Tuesday. Maybe.

  9. six_tymes

    meh, this time I don't believe their excuse. they got hacked and wont admit it. lol

    1. six_tymes

      a lot of aws workers here I see. "ha ha"

    2. Wayland Bronze badge

      We may not love The Cloud but we worship the man in the sky behind the cloud.

      1. jake Silver badge


        Who is "we", Kemosabe?

  10. Anonymous Coward
    Anonymous Coward


    Using a worker thread per neighbour server - is this '90s programming? Each thread typically reserving a couple of megs of stack space, my guess only getting occasional messages. 12 hours to restore the system by adding 'a few', say 3, hundred servers at a time, so about 4,000 servers. An async/event driven model would likely scale better...

    1. Anonymous Coward
      Anonymous Coward

      Re: Async

      Millennial coding.

      1. disk iops

        Re: Async

        that's who they hire by the bucket-load and most of them are H1B at that. Did you honestly think they had actual CS degrees and wouldn't design something so stupid?

        The only way forward is to use key-partitioning (like S3 does) and stop being so damn cheap about refusing to use load-balancers since they have their own in-house design for Pete's sake and don't have to pay Citrix for their NetScalers anymore.

        I don't remember how fast the S3 infra converges to a single-system-image, but S3 has 3 distinct tiers for starters and about 350,000 servers globally that need to eventually register and share 'knowledge' about their peers. If Kinesis is not using the correct/latest 'chatter' protocol to discover it's swarm, they are fracking idiots.

  11. Elledan Silver badge
    IT Angle

    Don't just throw more hardware at it

    From the autopsy report it sounds like they A) built a really complicated structure ('shards', 'streams') with countless threads to communicate between nodes instead of a single (or pool of) comms thread(s), and B) committed the cardinal sin of not having error detection and graceful degradation.

    While A) isn't an issue by itself, it made B) a lot worse. The real fix here would be to fix the issues in B), but it doesn't sound like they're doing that. Probably because writing good software and testing it under various scenarios costs time and money.

    In short, they'll very likely be back at this exact same meltdown scenario in a matter of months or years.


POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Biting the hand that feeds IT © 1998–2021