back to article AWS reveals it broke itself by exceeding OS thread limits, sysadmins weren’t familiar with some workarounds

Amazon Web Services has revealed that adding capacity to an already complex system was the reason its US-EAST-1 region took an unplanned and rather inconvenient break last week. The short version of the story is that the company’s Kinesis service, which is used directly by customers and underpins other parts of AWS’ own …

  1. chuckufarley Silver badge
    Facepalm

    I think they are Nerfing...

    ...the wrong object.

    Single Points of Failure are BAD...unless you have Too Many Points of Failure. In that case the KISS dogma will never run over your karma.

    As a private citizen that would be extremely put out if the entire world were to catch fire I cannot condone giving more control to a daemon that has fallen over so spectacularly. It's almost as if they fed SystemD'oh! steroids and expected good things to come of it.

    1. Anonymous Coward
      Anonymous Coward

      Re: I think they are Nerfing...

      "Single Points of Failure are BAD...unless you have Too Many Points of Failure"

      Seems like they had both - they added servers that all needed to communicate with each other via micro-services (redundancy, yay!) but then each OS ran out of threads. The AWS explanation is quite well written and this all shows the complexity of scaling up...

      1. Pascal Monett Silver badge

        Agreed

        As much as I despise The Cloud (TM), I have to admit that the engineers working on it are really pushing the limits.

        Now, the amount of RAM on a server is no longer the problem - it's the amount of threads a CPU can handle that is.

        Wow. Is there anything we can't push to the brink ?

        1. tip pc Silver badge

          Re: Agreed

          "Now, the amount of RAM on a server is no longer the problem - it's the amount of threads a CPU can handle that is."

          it wasn't cpu threads but the amount of threads the OS could handle.

          They stated the will amend the limitation in the os and also use bigger servers instead of more servers.

          But yes still pushing limits of their cloud platform.

        2. Cynic_999

          Re: Agreed

          ISTM that there is surely no inescapable need to have a separate thread for each server on the system. In fact doing so seems to me to be pretty inefficient. So instead of increasing the server capability, it would surely be better to re-write the method of operation so that it does not need to run so many separate threads - perhaps (as an off-the-cuff possiblility) by having one thread that polls each server round-robin style rather than having a dedicated thread for each server.

          1. MacroRodent
            Boffin

            Re: Agreed

            There is a well-known and widely used solution to this problem: thread pools at the userland level. Your threading library hands out fake threads that map to a smaller number of OS-supported threads. The fake thread may use a different real thread at different activations. This works because the threads usually sleep most of the time anyway if they have been spawned for communications purposes, and it is easier to dynamically allocate the required data for a large number of threads at userland level, than in the kernel.

            1. TimMaher Silver badge
              Thumb Up

              Re: Agreed

              Have to agree with both of you on this point.

              Using a thread per server is an odd design and it can be ameliorated by using thread pools, which have been around for a very long time.

          2. Warm Braw

            Re: Agreed

            This.

            "Does not scale" leads directly to "does not compute" in such scenarios.

            I can't glean a great deal of useful insight from the AWS post, but it does seem that there's a kind of circular dependency: the scaling and provisioning depends on higher-level services that don't work when there are scaling and provisioning issues. Calling up the protocol stack is a risky business, because it inevitably calls right back down.

      2. Anonymous Coward
        Anonymous Coward

        Re: I think they are Nerfing...

        Sounds like they had DBA and network engineer bods doing stuff that should have also involved OS engineers. Seems to me they didn't even consider the issue of OS limits and the fact that no matter how much you virtualise stuff and put hardware in load balanced clusters, ultimately it all runs on operating systems and hardware that has physical limitations. Sometimes I get the feeling people who should know better forget this rather important fact.

        But hey, its Da Cloud! Its all magic and Just Works, right?

        1. jake Silver badge

          Re: I think they are Nerfing...

          "But hey, its Da Cloud! Its all magic and Just Works, right?"

          That's what Marketing told us, so it must be true!

          1. Clunking Fist

            Re: I think they are Nerfing...

            Of course, otherwise they wouldn't have said it, Shirley?

        2. mikepren

          Re: I think they are Nerfing...

          I think it's worse than that. I think there design is wrong, for massive scale. Status messages shouldn't Nedd to be p2p, that's what you have topics and messaging for. In the days of on pi rem app servers you used to have state replication like that (p2p) but as you scale you moved to a different paradigm, like a central HA dB, or some broadcast technology.

    2. Crypto Monad Silver badge

      Re: I think they are Nerfing...

      > Single Points of Failure are BAD

      There was no SPOF here, but a complex, highly-coupled system, which also happened to have an N2 scaling issue (every node talks to every other node).

      1. Doctor Syntax Silver badge

        Re: I think they are Nerfing...

        Like the man said, too many points of failure.

      2. Anonymous Coward
        Anonymous Coward

        Re: I think they are Nerfing...

        Absolutely, but it could be argued that the single point of failure is having a single design for each of the nodes creating a mono-culture (it doesn't the reliability at a node level but affects the impact at a system-wide level).

        It also highlights the difficulties with realistic scaling in pre-deployment tests.

  2. amanfromMars 1 Silver badge

    Re: Plan One

    Plan one: use bigger servers.

    Is that plan akin to building bigger aircraft carriers which just also creates a bigger and more vitally important to planned operations target whenever deployed in service of customer requests? Creating a Goliath for a David to vanquish via similar unexpectedly simple means?

    Is not the problem still remaining to be solved one of secure and timely rapid communication across and between and within fleet servers of disruptive additions/problematic events which don't automatically, quite naturally, trigger panic overload conditions/systems meltdowns/command reorganisations?

    The difficulty then to resolve, whenever something is automatically quite naturally triggered, is the answer is not natural and be of artificial and/or alien being/construction/phorm with that realisation further hindering necessary reform and preventing timely human resolution leading to greater outrages in future outages?

    1. chuckufarley Silver badge
      Holmes

      Re: Plan One

      Are you saying that AWS is a symptom of Humanity's Auto-Immunity Disorder?

      1. amanfromMars 1 Silver badge

        Re: Plan One

        Are you saying that AWS is a symptom of Humanity's Auto-Immunity Disorder? .... chuckufarley

        No, although could/would it also be if not a result of Humanity's Auto-Immunity Disorder?

    2. mikepren

      Re: Plan One

      It's their immediate plan. It's going to take time to rearchitect from many to many to something more scalable, like a service mesh.

      1. Anonymous Coward
        Anonymous Coward

        Re: Plan One

        Either you didn't read/understand what the problem was, or do not know what a service mesh is.

    3. MJB7

      Re: Plan One

      No. They are planning to shift from "many thousands of servers" to either "a few thousand servers" or "many hundreds of servers" - not "tens of servers".

      1. Roland6 Silver badge

        Re: Plan One

        >No. They are planning to shift from "many thousands of servers" to either "a few thousand servers" or "many hundreds of servers"

        It does seem that Amazon have hit a horizontal scaling limit of a flat server net based on small increment (ie. single server) scaling. It would seem the natural solution would be to introduce some larger scaling unit; a natural unit would be to create a larger vserver that consists of hundreds/thousands of individual servers and adapt their service/server management strategy accordingly.

        Also it would seem that some form of broker service is going to be need. Both service orchestration patterns have been tried and tested over the decades, just need to be adapted for the cloud.

    4. Wayland

      Re: Plan One

      Throw better hardware at the problem is a sensible quick fix. However tuning the software is probably what's needed. The National Grid has similar problems with balancing it's load. Excellent in theory but seems to have a mind of it's own at the large scale.

  3. jake Silver badge

    From the Redmond school of repair.

    "Turn it off and then back on again."

    Would YOU trust these numpties with your corporate data?

    1. Graham 32

      Re: From the Redmond school of repair.

      How would YOU returned a failed system back to a known state?

      1. jake Silver badge

        Re: From the Redmond school of repair.

        What failed system? My system has been running non-stop since January 1st, 1981 ... and the only reason it went down then was because I decided to reboot everything and come up from scratch during the world-wide NCP to TCP/IP transition.

  4. davef1010101010
    Coat

    Translation - They built a "Cloud of Cards" ?

    And the cards got wet!

  5. Greybearded old scrote Silver badge
    Facepalm

    Foot, meet shotgun

    When every server talks to every other server, then exponential combinations are not your friend.

    Why no schadenfreude icon?

    1. Muscleguy

      Re: Foot, meet shotgun

      Even the brain doesn’t do that. Childhood is much more interconnected than in adult brains but links get pruned, or should do. Synaesthesia results from a lack of normal pruning. Kids are naturally synaesthetic to some degree. Also have very low impulse control and are prone to tantrums.

      The stability of adult brains points to a distributed node model. There are orders of magnitude more connections in a brain than in any server farm. There are issues with always on, system malfunctions leading to spurious errors (hallucinations) and even system death result from lack of appropriate downtime.

      Silicon may avoid the downtime issue since it’s due to the need to take out the trash, metabolites being dumped into the cerebro-spinal fluid which then drains into the lymph none of which really applies to silicon.

      Surely the internet architecture of nodes points the way to how to structure large server farms?

  6. Piro Silver badge

    Potential enormous boost for AMD

    There's nowhere else they can get very high core count cpus that still support multiple sockets

    1. doublelayer Silver badge

      Re: Potential enormous boost for AMD

      Only if they find three things: a) they can't get around the one thread per server thing, b) all their threads are putting too much pressure on the CPU, not just the OS's limits, and c) they still can't get around the one thread per server thing. So far, when they increase the CPU power available to the VMs, it's so they can reduce the total number of them rather than to get more concentrated compute. They could solve problem A by using a system that allocates threads to access requests rather than reserving one per server. Having done that, it seems unlikely that they'd experience problem B at all, based on their statements. If they did, they could always try to solve problem C by redesigning the system so it doesn't have quadratic scaling, for example by having certain nodes whose responsibilities are to contact subsets of the servers and keep that data available for servers in other zones. If all of those attempts fail, then AMD may have a cause to celebrate.

  7. rg287

    So they're saying the cloud isn't infinitely scalable?

    Who knew! ¯\_(ツ)_/¯

    1. amanfromMars 1 Silver badge

      Who Knows?

      So they're saying the cloud isn't infinitely scalable? ..... rg287

      Surely it's much more a case of their saying the cloud isn't infinitely scalable safely without the odd occasional problematic security issue?

      1. parperback parper

        Re: Who Knows?

        No matter how big your cloud, N-squared will kill you eventually.

        In this case number of front end server communication links.

    2. TRT Silver badge

      The cloud is infinitely scalable, but you occasionally have to reboot not just the sky, but the entire universe.

  8. xyz Silver badge

    Anyone remember 9/11

    When everyone had to drop back to static pages due to load. Don't suppose anyone on the BBC/AWS team was born then. Anyway, this is what worries me about the cloud, when the shit hits the fan your fallback options are zero.

    1. A.P. Veening Silver badge

      Re: Anyone remember 9/11

      Not everyone, the Fark-forum stayed up with running commentaries from eyewitnesses and there were also URLs which still showed dynamic content including live views on cnn.com (yes, I was in Europe and saw the second tower come down in a life stream while most of Europe wasn't even aware the first had come down).

    2. Pascal Monett Silver badge

      Re: Anyone remember 9/11

      Indeed, and I will never stop saying that if your own server falls over, it's only you and your customers that are impacted. You have mitigation options, if you care to put the money on the table.

      When The Cloud(TM) falls over, it's you and millions of other people that are impacted, and the only thing you can do is sit and wait until Someone Else's Server comes back online.

      I do not see that as an advantage for any company that is serious about making money.

      1. jake Silver badge

        Re: Anyone remember 9/11

        "I do not see that as an advantage for any company that is serious about making money."

        Indeed. Today it would seem that companies aren't all that interested in making their investors a long-term profit, rather they are interested in baffling their investors with bullshit buzzwords.

        1. amanfromMars 1 Silver badge

          Re: Anyone remember 9/11

          Indeed. Today it would seem that companies aren't all that interested in making their investors a long-term profit, rather they are interested in baffling their investors with bullshitbuzzwords. ...... jake

          You might like to realise that is the surreal state of markets today, jake. and is what is inventing market floats on the up side, and trying to keep afloat leaden and laden with crushing debt unicorn and zombie companies alike on the down side ...... with the biggest fools in present creation imagining it desirable and sustainable and never likely to flash crash catastrophically on their watch/in their life time ..... and just kicking the can of worms problem on down the road for their children and their childrens' children to suffer dealing with. Some parents those, eh? ...... Abuse doesn't even begin to describe the depravity of the activities perpetrated and perversely justified in practically all cases as being their legacy and for their own future good.

          :-) Just imagine, what are the chances of there being a long term catastrophic global pandemic emergency and the markets rising and being stronger and more valuable/valued than ever before whenever billions of folk are being denied even their basic necessary for living needs. It just wouldn't be accepted in reality, would it. The markets and that reality would be recognised as definitely bound to be rigged and a false economy in the thrall of right shady and shadowy non-government and non-state actors taking everybody else on the planet for useful fools on a useless ride/helter-skelter journey.

          Did y'all not get the earlier memo on the situation revealing the enigmatic conundrum and disease which eats you for its pleasures........ Network 1976 "The World is a business" GOD Speech scene Do you want it/need it in plain text/black and white too? That's not a problem and easily done if you do.

      2. jmch Silver badge

        Re: Anyone remember 9/11

        As a business owner, if I have my own servers it's up to my IT guys to keep it up, and restore it quickly if it falls over. If I'm using cloud, it's up to AWS or whoever's IT guys.

        I understand that good IT guys want to keep control, having confidence that they can have better uptime / easier resolution than with cloud. But not all businesses have good enough IT guys.

        So of course some business owners will think hey, maybe AWS or whoever will do a better job, or good enough job at a lower cost. And in many cases they would be right

        1. jake Silver badge

          Re: Anyone remember 9/11

          If it's in house, YOU have total control when it goes TITSUP on Friday afternoon. If it's on AWS, their IT guys might get around to giving a fuck on Monday. Or perhaps Tuesday. Maybe.

  9. six_tymes

    meh, this time I don't believe their excuse. they got hacked and wont admit it. lol

    1. six_tymes

      a lot of aws workers here I see. "ha ha"

    2. Wayland

      We may not love The Cloud but we worship the man in the sky behind the cloud.

      1. jake Silver badge

        We?

        Who is "we", Kemosabe?

  10. Anonymous Coward
    Anonymous Coward

    Async

    Using a worker thread per neighbour server - is this '90s programming? Each thread typically reserving a couple of megs of stack space, my guess only getting occasional messages. 12 hours to restore the system by adding 'a few', say 3, hundred servers at a time, so about 4,000 servers. An async/event driven model would likely scale better...

    1. Anonymous Coward
      Anonymous Coward

      Re: Async

      Millennial coding.

      1. disk iops

        Re: Async

        that's who they hire by the bucket-load and most of them are H1B at that. Did you honestly think they had actual CS degrees and wouldn't design something so stupid?

        The only way forward is to use key-partitioning (like S3 does) and stop being so damn cheap about refusing to use load-balancers since they have their own in-house design for Pete's sake and don't have to pay Citrix for their NetScalers anymore.

        I don't remember how fast the S3 infra converges to a single-system-image, but S3 has 3 distinct tiers for starters and about 350,000 servers globally that need to eventually register and share 'knowledge' about their peers. If Kinesis is not using the correct/latest 'chatter' protocol to discover it's swarm, they are fracking idiots.

  11. Elledan
    IT Angle

    Don't just throw more hardware at it

    From the autopsy report it sounds like they A) built a really complicated structure ('shards', 'streams') with countless threads to communicate between nodes instead of a single (or pool of) comms thread(s), and B) committed the cardinal sin of not having error detection and graceful degradation.

    While A) isn't an issue by itself, it made B) a lot worse. The real fix here would be to fix the issues in B), but it doesn't sound like they're doing that. Probably because writing good software and testing it under various scenarios costs time and money.

    In short, they'll very likely be back at this exact same meltdown scenario in a matter of months or years.

    1. fajensen
      Pint

      Re: Don't just throw more hardware at it

      A) isn't an issue by itself, i

      I think it is *exactly* the issue: AWS making their services insanely complicated may help sell some "Amazon Cloud University"-tickets ala Cisco, but it also backfires because now your own staff needs to be trained for a significant part of the time they spend at their desks and when something blows up, it not readily apparent what precisely has blown or how to fix it.

      Trying to limit AWS access rights for an AWS Lambda service so that Russian Hax0rs and internet randos cannot blow up ones credit card or the service is absolutely not a trivial excersise. It is not surpricing that people leave open AWS services all of the time; The working examples in the AWS docs are all of the "chmod 777"-kind and it is head-explodingly difficult to get to a different state.

      1. disk iops

        Re: Don't just throw more hardware at it

        because the people who write the code are your typical programmer who doesn't understand permissions or access rights - ie. has ZERO hands-on systems admin experience. So of course, 777 is the answer because that's how it has to be on their local laptop.

        1. Anonymous Coward
          Anonymous Coward

          Re: Don't just throw more hardware at it

          They are admin on their windows laptop and run the process as their admin user, the chmod 777 makes zero sense to them, lucky all developers are now operations Devops bro’s and all the admins have been fired cause the developer can use an abstracted terraform program to do chmod 777 in his container and no one will ever know especially not the developer

  12. Mike 125

    same old

    1: Not enough threads?

    2: Configure more threads

    3: backto 1

    ----------------------

    1: Too many cars?

    2: Build more roads

    3: backto 1

    -----------------------

    1: Not enough cheap meat?

    2: Burn the Amazon for more cows (see what I did there)

    3: backto 1

    1. TRT Silver badge

      Re: same old

      I particularly like the third one after they roasted Bezos in Spitting Image:10, which has to be one of the best skits ever.

  13. Martin hepworth

    Open on failure

    Would love to see ANY other cloud provider be this open this quick....

    Stuff happens it's how you handle it that matters. Come on Azure/GCP we're looking at you

  14. Anonymous Coward
    Anonymous Coward

    Presenting the "Well-Architected Framework"

    Proprietary many-to-many connections for edge cluster data sharing? looks like theyre not using their own cache product pills. Now we know.

    1. Muscleguy

      Re: Presenting the "Well-Architected Framework"

      Maybe with their insider knowledge they know the offering is pants?

  15. andy 103
    Mushroom

    Isn't the point of cloud infrastructure that you can do more with less?

    That's funny because most cloud providers including AWS are telling everyone the main point of their existence (as far as customers are concerned) is that you can do everything using fewer resources, therefore costing you less.

    Except when it goes wrong. Then you need more resources to fix the problem.

    Almost as if you hadn't bothered with any of it, everything would have been better.

  16. Martin
    FAIL

    And this is going to happen over and over. The old guys, like me, who worked with resource-limited systems and invented loads of neat tricks (which now should be standard patterns, in fact) are now getting old or retired. The young kids, straight out of college, don't realize that just because you've got scads of memory and CPU, doesn't mean you still can't run out of resources - and they implement something like this.

    No error checking? No warning that the number of threads is getting too high? No comms thread pool, as someone else suggested? This was a system where someone just added a bit more capacity and suddenly the whole thing falls over?

    Well, give them their due for admitting to it. But still, this is actually dreadtul. Whoever let that design go live should be strung up.

    1. TeeCee Gold badge
      Facepalm

      I agree:

      ...to do so create new threads for each of the other servers...

      That should have raised a red flag in the bloody design phase for something that was always intended to scale across a metric fucktonne of servers.

  17. Claptrap314 Silver badge
    Facepalm

    n^2? Are you #*$&ing kidding me?

    Having learned SRE by supporting Hangouts at Google, I know exactly why they have the front end servers all talking to each other. It does not matter. IT. CANNOT. SCALE. Software engineers completely rearchitect systems rather than implement n^2. If we cannot figure out how to do it ourselves, we ask for help. If that means bringing in a computer scientist, that's what we do. I'm not saying I've never delivered n^2 into prod. I'm saying I never delivered it into prod where scale would ever be a concern.

    The worst part about it is this: "To speed restart, in parallel with our investigation, we began adding a configuration to the front-end servers to obtain data directly from the authoritative metadata store rather than from front-end server neighbors during the bootstrap process." In other words, Amazon switched to an n log n solution in order to dig themselves out of the hole, but then immediately went back to the old way. Brilliant.

    Sure, cells will help. Sure, expanding the number of threads the OS supports will help. Now, where do you put the hard limit on the number of servers in the cell based on the number of OS threads so that no human overrides or changes it? No. Bad architect. No more colored markers. Fix the n^2 problem.

    But, as is often the case, this jape is not over. Their systems, nearly overloading from processing new configuration information, were trying to handle customer traffic--and overloading completely. Let me let you in on a secret: "My job is to keep the network up. It's merely out of my good graces that I allow customers on at all." If your servers are returning 100% 500s while keeping themselves and the network healthy, that can be recovered quickly. If they fall over, or the network does, that is BAD. Really, really bad. Design your systems to drop 100% of traffic before they fall over.

    Moving the critical services to dedicated servers is part of how you do that--that's a good move. But only part of it. As Google recently found out, you need a strict hierarchy for traffic. Configuration changes over everything. Critical logs next--but ever here, you require rollups & squelching. Server fleet health traffic next. Then you get into general status & servicing the customer.

    The other issue, and Amazon appears to be feeling its way in this direction, is that you MUST have a clear understanding of your dependencies, and systems in place to handle the failures of these dependencies. Throwing capacity at a problem does not make it go away--it makes the eventual failure that much more spectacular.

    1. disk iops

      Re: n^2? Are you #*$&ing kidding me?

      The S3 Index tier is a close analog. When we had to mass-bootstrap the tier the nodes fetched their config from a 'static' source before they fell back to 'chatter/gab' mode to converge. I can't remember how we partitioned node sets but the respective 'master's eventually got their immediate peers all registered and sent updates upstream to other cells and eventually every cell got wind of all the other cells. But we sure as HELL did not maintain N^2 active connections!!

      This was SOLVED 10 years ago by the S3 team (and probably the EBS team). Kinesis apparently didn't bother to avail themselves of the existing codebase.

      This is not an uncommon occurrence at AWS - the teams don't talk and apparently Jasse and his minions haven't beaten the individual service teams with the "REUSE THE GOD DAMN CODE!" hammer enough.

  18. IGnatius T Foobar !

    Should be obvious...

    AWS is too big, too dominant, and is AS A WHOLE a single point of failure. Use other cloud providers.

  19. bazza Silver badge

    Profligacy?

    One thread per server sounds, well, excessive...

    1. Imhotep

      Re: Profligacy?

      But it's just one thread. Should be fine.

      *Well, except for that per server part, and everything going down.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like