back to article Slack fingers AWS auto-scaling failure in January outage postmortem

Slack says it has identified a scaling failure in its AWS Transit Gateways (TGWs) as the reason for the chat service's monumental outage on 4 January. As a result, Amazon's cloud computing arm said it is "reviewing the TGW scaling algorithms". The comms platform's outage came at a bad time, just when people were getting back …

  1. Pascal Monett Silver badge

    "it is a matter of balancing different risks and finding the best compromise"

    We're still at the stage where we are finding out what works. Slacks' downtime is proof that all the kinks are not ironed out yet, and proof that engineers haven't thought of everything.

    At least, I expect a complete log review and meetings to amend the processes that are in place.

    We'll see what happens next year.

    1. sgp

      Re: "it is a matter of balancing different risks and finding the best compromise"

      Lucky they don't run a nuclear power plant.

  2. Claptrap314 Silver badge

    To be clear

    And with apologies to Howard Taylor...

    The most important traffic is when your engineers are fixing things.

    That means that notices that there is a problem to your engineers are the second priority.

    That puts customer traffic down to priority three, maybe five.

    If you are experiencing significant packet loss on your back end, your quota system has failed you horribly. Drop that customer traffic. Drop it now. They will retry in a bit. If your backend falls over, it might not come back for quite a while.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like