back to article How an Amazon engineer's slip-up started a 20-hour Netflix cock-up

An Amazon engineer hit the wrong button on Christmas Eve, deleting critical data in its load balancers and ultimately knackering vid streaming biz Netflix for 20 hours. The Netflix outage hit customers in the US, Canada and Latin America on 24 December, particularly those using games consoles and mobiles to watch films, while …

COMMENTS

This topic is closed for new posts.
  1. robert_raw

    I am glad I am not that man (or woman)!

    1. LarsG
      Meh

      Go on...

      Go on, blame it on the little guy!

      Now tell us what really went wrong....

  2. Steven 1
    FAIL

    Sounds like a RGE - Resume Generating Event...

    1. Anonymous Coward
      Anonymous Coward

      Did he accidentally hit the 'DO NOT PRESS THIS BUTTON' button under the 'DO NOT PRESS THIS BUTTON' sign?

      I think not.

      1. Anonymous Coward
        Anonymous Coward

        "Did he accidentally hit the 'DO NOT PRESS THIS BUTTON' button under the 'DO NOT PRESS THIS BUTTON' sign?"

        It's surprising how many times someone has hit an inviting "emergency" red button - and been unable to explain why its prohibition had exerted such a fatal attraction.

    2. Zaphod.Beeblebrox
      FAIL

      Alternatively, a CLM - Career Limiting Move.

      1. Anonymous Coward
        Anonymous Coward

        "Alternatively, a CLM - Career Limiting Move"

        According to the Peter Principle it is more likely to generate a promotion.

        1. Destroy All Monsters Silver badge
          Holmes

          No, for that you already need upward-balooning momentum, a nice suit and a few files on company dirt that you could "forget in the bus".

          1. Field Marshal Von Krakenfart
            Angel

            No personal experience of such an incident, but..

            It was probably the computer operators playing Frisbee with the tape container covers in the computer room and accidentally hitting the tape drive reset button with the cover.

            Not that I have any actual experience of such a thing happening <cough> <cough>, but I have heard that it happens sometimes.

            I'm innocent, I promise...

        2. Fatman

          RE: According to the Peter Principle it is more likely to generate a promotion.

          Only if he was in manglement!!

    3. Fatman
      FAIL

      RE: Sounds like a RGE - Resume Generating Event...

      Preceded by a ETE - Employment Terminating Event.

      The only appropriate icon for this screw-up.

  3. mhoulden
    FAIL

    In various projects I've worked on that involve critical data, we usually have a ban on major changes on Fridays or just before a public holiday, and a change freeze until Christmas is well out of the way. What went wrong at Amazon/Netflix that allowed this to happen?

    1. asdf
      FAIL

      easy

      >What went wrong at Amazon/Netflix that allowed this to happen?

      Poor management which is almost always the case in these incidents. Its much easier to blame some peon but for mission critical infrastructure like this not only should it not have been possible for the peon to accidentally do this but he should not have been able to affect service even if he maliciously tried (yes i know a pipe dream in most reactive only crap corporate environments). If the often times sociopaths in charge are going to take the big salaries then they should occasionally be responsible for something.

      1. A J Stiles

        Re: easy

        Bad managers blame their workers, just like bad workers blame their tools.

        And yeah, I never make any change on a Friday that can't be reverted using ConnectBot on my mobile phone, on a crowded no. 38 bus (which route passes through a fairly serious 3G blackspot).

        1. Anonymous Coward
          Anonymous Coward

          Re: easy

          Sadly I work in a stupid company where all major things happen on a Sunday morning 6am>10am (with a single engineer - who is normally also the only on call engineer.)

          1. Fatman
            WTF?

            Re: Sadly I work in a stupid company where all major things happen on a Sunday morning

            Are you so sure that is all bad?

            WROK PALCE had to """fix""" a telco related """fire hazard""" involving a shitload of phone lines that were not "plenum rated cable" (according to """fire marshal"""). On a Sunday morning, WROK PALCE is only manned by security, and no one else. So, I have to ask, do you want phone lines going down during the business day, with employees at WROK, or do you want the phone lines going down when most employees are at church???

            Let me see, I will take Sunday morning, any time for this kind of downtime.

            1. Blitterbug
              Unhappy

              Re: Sadly I work in a stupid company where all major things happen on a Sunday morning

              Church?

              Seriously?

              Blimey.

              1. Anonymous Coward
                Anonymous Coward

                Re: Sadly I work in a stupid company where all major things happen on a Sunday morning

                'Murrica.

              2. Fatman
                Pint

                Re: Church? Seriously?

                At least that is what they say!!

                Now, if you think I believe most of them, then, I have this swampland in Florida I could sell you.

                For a few, I seriously doubt that they could assume a stand-up position on a Sunday morning.

                Icon expresses why!!

          2. Anonymous Coward
            Anonymous Coward

            Re: easy

            Lucky bugger, try between 1am and 5am Sunday mornings....

          3. Tom 13

            Re: Sunday morning 6am>10am

            Granted 6pm < 10 pm on Friday would probably be better depending on the business, that's still better than 4am>8am Monday morning.

            Of course the guys I really feel sorry for are the point of sale vendors for fast food joints. My friend's migration schedule is always 3am to done with training before and after.

    2. Anonymous Coward
      Anonymous Coward

      I guess it went like this.

      Netflix top honcho moans to Amazon top honcho about wanting everything to be as fast as possible and super dooper for Christmas and everything seems a bit "slow"

      Head Amazon honcho moans to AWS top honcho going "Why is netflix complaining everything is slow RAR RAR RAR"

      AWS head guy goes to Ops manager "RAR RAR RAR I just had Our Head Honcho moan at me that the systems slow clean it up!"

      Ops manager sighs goes to team "I know it's bollocks but Bob you need to run the maintenance on the nodes for Netflix coz everyone is moaning"

      Bob wanting to go home and start drinking runs the processes but against the wrong object ID then goes home to the family / to the pub

      That is one possible option.Given netflix isn't really a mature outfit and AWS will do what they're told, I can imagine that being the situation.

      1. P. Lee

        > Ops manager sighs

        and says," Its christmas, we're in the middle of a change freeze and we won't be doing fixing anything which isn't already causing an outage, or is likely to cause an outage before the end of the freeze."

        Amateurs!

    3. Psyx
      Stop

      "What went wrong at Amazon/Netflix that allowed this to happen?"

      They made a techie work on Xmas eve. That was their first mistake. He probably didn't want to be there and wanted to get home.

      The second mistake was being too tight to pay the over-time for TWO guys to watch each other's backs and spot mistakes.

      1. Field Marshal Von Krakenfart
        IT Angle

        "What went wrong at Amazon/Netflix that allowed this to happen?"

        Did Amazon/Netflix out source/off shore to India à la RBS?????

        Just asking

      2. Van

        24x7 ?

        The poster claiming it was operators playing frisbee is a closer guess. I would expect the data center to be manned by a large team of operators 24 x7. And with a 25% shift allowance + extra holidays, they most certainly would want to be there. Eating Pizza, watching TV, in between housekeeping tasks.

    4. This post has been deleted by its author

    5. P. Lee
      Coat

      > we usually have a ban on major changes on Fridays or just before a public holiday,

      Yep, changes are Tuesdays (to leave Monday for final planning and cleaning up after the weekend and avoiding "Monday-itus") and Thursdays (because no-one wants to work weekends and its cheaper on overtime payments).

      Would it be rude to point out that torrents are naturally fault tolerant and cheaper than F5's?

  4. Anonymous Coward
    Anonymous Coward

    Huh ?

    I would have thought that by definition, "Elastic Load Balancing" would be an adaptive process. Just deleting the state data would temporarily unbalance things until it "learned" again.

    At least that's how *I* would have implemented it. If I was going to call it "Elastic".

    1. NomNomNom

      Re: Huh ?

      I would have made it so that if you strain it too hard it snaps and the pieces go flying across the room and hurt people. they didnt even invite me to an interview though.

      1. Fatman
        WTF?

        Re: that if you strain it too hard it snaps and the pieces go flying across the room

        WHY, did an image of Steve Ballmer sitting in an executive chair, being ricocheted by a over stretched bungee cord; and being hurled out the window of Microsoft's HQ suddenly pop up in my mind?????

        1. Blitterbug
          Happy

          Re: WHY, did an image of Steve Ballmer ...

          Possibly for the same reason that in my house our two new all-in-ones running 'Microsoft Window' almost ended up with a mug shot of His Steveness mapped to the ClassicShell start button. We decided against, though the thought still tickles.

          1. Field Marshal Von Krakenfart

            Re: WHY, did an image of Steve Ballmer ...

            A bungee boss????

    2. Tom 13

      Re: Huh ?

      Even AT&T has problems building elasticity that can handle losing a large chunk of their normal bandwidth. The expectation is for random single failures that account for maybe 1% of the load. They get good at dealing with those. But kill 25% instantaneously and the cascade failures start taking down the rest of the system. Sure they stress test it in a VM lab, but for some reason the real world never seems to work that way. And you rarely get real world

      No it shouldn't be that way, but all too often it is.

  5. A J Stiles
    WTF?

    Conspiracy Theory Alert

    Amazon own LoveFilm.

    Cue much chin-scratching .....

  6. NomNomNom

    good. people shouldn't be watching movies on the eve of Jesus's birthday.

    Perhaps the amazon engineer was working for the church

    1. sisk
      Joke

      You forgot your icon (I hope).

    2. Mr Young
      Angel

      Thanks -

      I probably needed to recalibrate my sarcasm detector anyway.

    3. Armando 123

      "good. people shouldn't be watching movies on the eve of Jesus's birthday."

      Right, they should be fighting with loved ones. Or, in the words of Paul Gilmartin, "disfunction rears its yuletide head".

      1. Field Marshal Von Krakenfart

        "good. people shouldn't be watching movies on the eve of Jesus's birthday Dies Natalis Solis Invicti

        Fixed it for you.

        There's also the god of wine, Dionysus, also called Bacchus, also called Iacchus, Born December 25th to a virgin mother; performing miracles such as changing water into wine; died and was resurrected after three days and ascended into heaven.

        If I remember correctly there was also a minor cult in the Roman army who worshipped a dead Roman soldier who was born on December 25th, died/was killed and was resurrected after three days.

    4. Destroy All Monsters Silver badge
      FAIL

      Do you even pagan?

      I sure hope you celebrated The Aramaean One's Birthday in front of Stonehenge!

  7. Paul Hovnanian Silver badge
    Facepalm

    Change processes themselves have to be tested and controlled. That errant maintenance process should have been run against a test environment prior to being used on production sites.

    And then a test suite needs to be included and run to ensure that the test/production sites are still up and running after the change is applied.

    1. John 104
      FAIL

      This is why you don't let Devs into production environments - EVER.

      1. Mr Young
        FAIL

        You've obviously never experienced...

        ...what production can actually do with your *precious* designs have you? It barely fucking met the spec before you guys got yer hands on it etc

  8. Anonymous Coward
    Anonymous Coward

    Wonder if Amazon have been hiring ex RBS employees?

    1. MissingSecurity

      Least they didn't blame it on outsourcing.

      1. Anonymous Coward
        Anonymous Coward

        but isn't using AWS a form of outsourcing by Netflix ?

    2. Fatman

      RE: Wonder if Amazon have been hiring ex RBS employees?

      Damn you!!!!

      Another keyboard fucked up!!!!!!

  9. Rick Giles
    Linux

    Now that I know

    Netflix is using Amazon's cloud I may just have to drop them. This is the kind of shit that is going to happen more and more as these idiots give up control of their data and infrastructure.

    Besides the fact that Netflix doesn't have a Linux app is probably the main reason I want to drop them.

    1. Mike VandeVelde
      Trollface

      "going to happen more and more"

      At current monthly subscription rates that must have been almost $0.25 worth of service we each lost there, no joke if they keep that up the whole economy will soon grind to a screeching halt! One less option for ignoring friends and family, on Christmas of all days when you all know we need it most!! Think of the children!!!

  10. Anonymous Coward
    Anonymous Coward

    Shoulda gone with Akamai

    'nuff said.

    1. John 104
      Thumb Up

      Re: Shoulda gone with Akamai

      Ya, but Akamai is expensive. :D

      We use it here at work and our management keeps coming to me asking if there is a cheaper alternative. I keep saying,yes, but when was the last time we had an outage because of Akamai? The answer is never.

  11. Anonymous Coward
    Anonymous Coward

    My DVDs were unaffected.

    Actually, the outage even failed to affect my video tapes.

    1. Anonymous Coward
      Anonymous Coward

      Re: My DVDs were unaffected.

      lol

  12. Anonymous Coward
    Anonymous Coward

    I know I'm going to be unpopular but what the heck...

    I'm getting rather fed up of the argument from some that this is all down to the change processes and that heads should not roll. I know my organisation's efficiency would improve immensely if I were allowed to fire some asses now and then rather than just shuffle them off to the side to some role where (I hope) they cannot do any damage. I often wish that management had not downsized HR quite so much so that there were actually some warm bodies who would help me satisfy all the regs for sacking someone so I could use the money to hire someone decent instead...

    1. Hooksie

      Re: I know I'm going to be unpopular but what the heck...

      No wonder you were AC on that comment. You think it should be ok to sack people because managers like you continue to ask them to do things that they aren't trained for, don't have the time to finish, isn't their responsibility and that you already outsourced or downsized the team that was SUPPOSED to do that job. Oh, and on top of that you give them a 2.5% pay 'increase' then blame the market conditions.

      To err is human, to really fuck things up requires a computer, a tired engineer and piss poor management.

    2. Fatman

      Re: efficiency would improve immensely if I were allowed to fire some asses now and then

      Simples, just hire this guy:

      http://disqus.com/JIMTHEBOSS/

      Check out some of his CW posts - perfect manglement material.

    3. asdf

      Re: I know I'm going to be unpopular but what the heck...

      > there were actually some warm bodies who would help me satisfy all the regs for sacking someone

      Wow definitely not an American in a right to work state then. Right to work someplace else no questions asked is what it should be called. It sounds worse than it is though in that it is generally easier to find a job as their is less risk in hiring someone but you are lucky if you find a place that treats you as anything but an asset though.

  13. Anonymous Coward
    Anonymous Coward

    Making bad assumptions

    Actually I'm the kind of manager that fights tooth and nail to get my team trained, proper pay rises, promotions and fight against outsourcing and downsizing. I have hated it in the past when I have had to make good people redundant. I don't ask any of my team to do anything that I cannot do myself. All of which is why I'll never rise any further. However the propensity of some people to take the piss does make life worse for everyone else. If you know your UK employment law and your employer stints on HR you can be almost unsackable.

    And 2.5%. I'd love to be able to secure that kind of rise for the best people in my team.

    1. Anonymous Coward
      Anonymous Coward

      Re: Making bad assumptions

      " I don't ask any of my team to do anything that I cannot do myself"

      You are either the most talented person in the world, run the least skilled IT department in the world, or the best bullshitter in the world, or as you don't seem to expect your staff to do stuff you can't do, could explain your own lack of promotion.

  14. fnusnu
    Facepalm

    Looks like Amazon's staff are better than Chaos Monkeys...

  15. DaveNullstein

    Shit happens.

    Always will.

    1. Euripides Pants

      Re: Shit happens.

      Or, in this case, clouds dissipate...

    2. This post has been deleted by its author

  16. pstones578

    Change Control / Change Freeze

    Would it not make sense to have some proper change control and then Amazon could have reviewed their change documentation and hey presto notice a change had happened around the time of the problem. Also while they are at it wouldn't it also be a good idea to have a change freeze around such a critical time of year! Unbelievable

  17. Vince

    Blame the engineer, ignore the cause.

    So the problem is...

    (a) Netflix have poor business continuity planning and rely on a single supplier (AWS) for its systems.

    Cause: Poor management decisions/understanding

    (b) AWS have poor processes that allow a single point of failure

    Cause: Poor management decisions/understanding

    (c) Netflix believed the "cloud" of Amazon would be redundant against anything and assumed they had covered the issues in (a)

    Cause: Poor management decisions/understanding

    The real issue isn't the engineer that "made an error" but the AWS system that can fail despite supposedly being uber-geo-redundant and so on, and the Netflix management who decided to put the eggs in one basket.

    As I understand it, Netflix have local content caches with various ISPs so I assume the issue was the database/account side and not the underlying content availability - so it would be a *relatively* less expensive task to put a better system in place (I'm not pretending it is trivial, but it's obviously "less tricky" when you haven't got to replicate what I assume is a huge amount of content which would be costly to store/stream en masse

    A better fix would have been to have multiple providers and the ability to have the Netflix client(s)/website(s) detect/choose/forced etc.

    Of course this would require more expenditure and at £5.99 (or it seems a penny more if you subscribed more recently) it's unlikely there's enough margin I guess.

    1. Don Jefe
      FAIL

      Re: Blame the engineer, ignore the cause.

      Well, at least we all now know Vince has never been or probably never will be in any sort of management role.

    2. Anonymous Coward
      Anonymous Coward

      Re: Blame the engineer, ignore the cause.

      Nice word that 'assume'

  18. Anonymous Coward
    Anonymous Coward

    Change control

    It really says something about the Amazon change control process. It also says volumes about their support staff; both the person that did the deleting and the subsequent ones that did the troubleshooting. When they encountered missing data, the first thing should have been to look at who made a change, what the change was and what was actually changed. I think Amazon needs to invest in an AAA solution.

  19. Anomalous Cowturd
    Black Helicopters

    AAA solution

    Anti-Aircraft Artillery?

    Make ready my 88mm please, Jeeves.

  20. John H Woods Silver badge

    As with the NatWest disaster ...

    ... it should not be possible for a single engineer to wreak this kind of havoc: systems like this should be resistant even to deliberate malice. Your engineer could be tired, inexperienced or unwell. But they could also be a saboteur working for a competitor, an employee with a grudge, a criminal who is going to hold your system to ransom or even an out-and-out terrorist.

  21. Don Jefe

    Preperation

    To prepare for every contingency is usually an excuse to ignore the real world.

    - Me 2003

  22. gcarter
    Facepalm

    /me shakes his head from side to side and ads another handful of movies to his couchpotato / newsgroup queue... can't beat locally stored content :-)

    Its an irony how us web pirates have a more robust solution than the poor souls who choose to go legit ;-)

    1. Fatman

      RE: can't beat locally stored content :-)

      Don't 2Tb drives make for some nice amounts of locally stored content!!!!

  23. Sir Codington
    Trollface

    Who does maintenance on Christmas eve? There is always a risk, mostly to one's holiday time.

  24. Andrew Jones 2

    I find it incredibly annoying that Netflix are still claiming it didn't affect people in the UK -

    It bloody well did. But still no explanation why.....

  25. Anonymous Coward
    Anonymous Coward

    Thats Netflix fixed.....now for MSFT Media Center?

    Now all we need if for the engineer, PFY or intern who went on xmas vac forgeting to flick the switch to update the TV Guide data in Media Center to sort out the updates there (we know BDS Ltd have sent data packages to MSFT for upload) then everyone will be happy :-) (For ref UK data ended Jan 1 so having to use dead tree TV guides and ending up with loads of "Manual Recordings" :-( )

  26. Anonymous Coward
    Anonymous Coward

    maybe, just maybe

    the idiot who didn't check his/her work should shoulder some responsibility for this. perhaps, before initiating a change that could cause a major service outage during a peak usage period, they should take a minute to really look at what they've instructed the system to do before they hit the go button.

    i can't see how this is management's fault: it's just poor workmanship.

    i'm assuming it's all techies who have blamed the managers. well, i am a techie and this is just someone doing a shit job because it's xmas eve and they're not paying attention.

This topic is closed for new posts.

Other stories you might like