back to article Amazon S3-izure cause: Half the web vanished because an AWS bod fat-fingered a command

Amazon has provided the postmortem for Tuesday's AWS S3 meltdown, shedding light on what caused one of its largest cloud facilities to bring a chunk of the web down. In a note today to customers, the tech giant said the storage system was knocked offline by a staffer trying to address a problem with its billing system. …

  1. Anonymous Coward
    Anonymous Coward

    So much for fault injection testing !

    " Hey Ravi - run this CLI ... that is what fixed it last time ... "

    1. Mpeler
      Mushroom

      Re: So much for fault injection testing !

      Here's a song for them then:

      I've looked at clouds from both sides now

      From up and down, and still somehow

      It's cloud illusions I recall

      I really don't know clouds at all.....

    2. The IT Ghost

      Re: So much for fault injection testing !

      Plenty of fault was injected, no doubt. Probably 4 or 5 people shown the door, none of them the one who actually flubbed the command.

    3. TheVogon

      Re: So much for fault injection testing !

      "an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process," the team wrote in its message.

      "Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended."

      Wow they make manual command line changes that can impact lots of production systems?! Glad I don't use Amazon then. Such changes should be planned, change controlled, scripted in a file, and 4 eyed before pressing go....

  2. Herby

    To err is human...

    ...to really foul things up requires a computer.

    To guarantee a mess put a human in charge of said computer. Enough said.

    Fat fingers win every time as in "I only changed one card line"...

    I'm showing my age...

    1. TitterYeNot

      Re: To err is human...

      "To guarantee a mess put a human in charge of said computer. Enough said."

      And to guarantee a shitstorm of Diluvian proportions, put said barely-technical human in front of some automation they don't really understand - but hey it looked great in the management meeting.

      It'll save us loads of money, they said.

      It'll guarantee five nines availability, they said.

      It's foolproof, they said...

    2. Anonymous Coward
      Anonymous Coward

      Re: To err is human...

      ""I only changed one card line"..."

      What did you change?

      Nothing.

      What did you change?

      Nothing .....that is relevant.

    3. SotarrTheWizard

      Re: To err is human...

      Funny that you mention punchcards: I recently pulled one of my old boxes of code stacks out of the cellar, to let my grand-daughters make the quintessential early-70s craft project: the Punchcard Christmas Wreath.

      I had forgotten the joys of card stacks, and the multiple marker and highlighter lines across the top of the deck to help quickly restore the deck if you dropped it.

      Good times, good times. . .

      1. Anonymous Coward
        Anonymous Coward

        Re: To err is human...Punch cards devine

        The 1970's with it's punch cards was good times, a peak in many ways for Canadians, and I'm not talking Fortran WATFOR or WATFIV.

        Back then the average family income was about $10,000. That's about $65,000 today, which if you look up family income is still roughly about the middle of family incomes today. No real growth but apparently not much of a set back, until we look at where that income comes from and goes.

        In 1970's family income was usually from a single income. Today almost all $65K families are at least dual income and thanks to dramatic changing in Canadian taxes, from who and how much is collected they do not get to keep much of that. Even the US numbers show us what good times the past was when it came to growth and optimism. .

        "Expressed in 1950 dollars, U.S. median household income in 1950 was $4,237. Expenditures came to $3,808. Savings came to $429, or 10 per cent of income. The average new-house price was roughly $7,500 – or less than 200 per cent of income. By 1975, however, it took 300 per cent of median household incomes to buy a house; by 2005, 470 per cent."

        Many more years in school and training are required to get a job, all adults in a family have to work, most at jobs with much longer hours and often no benefits and today it is almost impossible to get a detached house in a major Canadian city for even 10X the annual income of the average high school graduate.

        When I look fondly at punch cards I am reminded that the good times was largely the result of citizens being "allowed" to share in the wealth they were creating.

      2. Anonymous Coward
        Anonymous Coward

        Re: To err is human...

        as late as 2001... we used blank punch cards at IBM as note pads / post it notes .. the file cabinets were stacked with them instead of note pads.

    4. fidodogbreath

      Re: To err is human...

      In the original Reg article about the S-pocalypse, I commented that the last voice command ever was "Alexa, turn off all the servers." Turns out, that's more or less what happened.

      Since the outage took down IFTTT, "Alexa, turn all the servers back on" didn't work.

    5. ZootCadillac

      Re: To err is human...

      Herby, don't misplace the punch tape!

  3. Anonymous Coward
    Anonymous Coward

    Homo Sapien Ergonomics

    I wish my finger tips were smaller than the average keyboard key.

    Otherwise, I'm quite proud of my Neanderthal heritage.

    1. MyffyW Silver badge

      Re: Homo Sapien Ergonomics

      I'm quite proud of my amply covered form, but plump fingers are a bloody nuisance.

      1. Anonymous Coward
        Anonymous Coward

        Re: Homo Sapien Ergonomics

        plump fingers are a bloody nuisance.

        Yes, but for most of the bigger boned, that's down to choices they've made (eg, whilst passing Greggs). It is also one that they can unmake, if the downsides of podgy digits get too much?

    2. gotes

      Re: Homo Sapien Ergonomics

      I wish the enter key wasn't so close to the backspace key.

  4. John Smith 19 Gold badge
    FAIL

    Makes me wonder how many others in the "playbook" have this capacity.

    Well it should be making Amazon wonder that.

    Under what circumstances would you want to be able to (virtually) shut down a whole data centre with one (mis) executed command?

    1. Anonymous Coward
      Anonymous Coward

      Re: Makes me wonder how many others in the "playbook" have this capacity.

      They dig into this to an extent in the full statement. The command alone wasn't enough to do it. It was running a command designed for a much smaller scale of S3 over too many machines causing a bunch of systems subsequently layered over those machines to mutually screw each other up.

      Critically it was the requirement to restart that really screwed them. The system hadn't been restarted in so long no one noticed the restart procedure took a really, really long time. Cheeky little humblebrag, methinks.

      They also mention a full audit of existing operations to ensure sanity checks are in place. I for one look forward to the outage caused by being unable to affect a change to as many machines as actually needed, because sod's law's just like that.

      1. Bronek Kozicki

        Re: Makes me wonder how many others in the "playbook" have this capacity.

        I think they need "chaos monkey" to occasionally reset some machine or shutdown some process. At random. That would force them to learn building inherently resilient systems, quickly.

      2. John Smith 19 Gold badge
        Unhappy

        "They also mention a full audit of existing operations to ensure sanity checks are in place. I"

        Oh dear, that sounds like an event.

        Not a process.

        Which suggests they will find (and hopefully) fix all such issues this time round a whole new bunch will accumulate over time till the next one surfaces and borks them again.

        Periodic review following significant (cumulative) changes should be SOP for such a large operation.

    2. Anonymous Coward
      Anonymous Coward

      Re: Makes me wonder how many others in the "playbook" have this capacity.

      "Under what circumstances would you want to be able to (virtually) shut down a whole data centre with one (mis) executed command?"

      Ultimately somebody has to have the power to do this because shutting down servers is a valid admin activity. However it should be made a multistep process with plenty of Are You Sure? types prompts (or even somehow require 2 people/keys nuclear missile launch style), not something that can be done with a single mistyped command. In the end its a balancing act between treating your admins like responsible professionals and not children who need to be hand-held, but also ensuring one tired person can't make an almighty cock up.

      1. Keith Langmead

        Re: Makes me wonder how many others in the "playbook" have this capacity.

        "However it should be made a multistep process with plenty of Are You Sure? types prompts"

        Not just "are you sure Y/N", but also "Here's exactly what is about to be done... is that correct and what you actually intended? Y/N", otherwise anyone would just assume the command they'd entered would do what THEY intended, not what the command was about to do.

        1. Bronek Kozicki

          Re: Makes me wonder how many others in the "playbook" have this capacity.

          Not Y/N , but "in the prompt below, enter the missing from the above shell command, to make it work". Force them to read and think, that is.

          1. donk1
            FAIL

            Re: Makes me wonder how many others in the "playbook" have this capacity.

            1st prompt

            This will shutdown 1040 servers, please type 1040 to continue.

            2nd prompt

            This will reduce capacity enough to cause a service failure for the following 8 services

            A

            ...

            G

            Please type "8 SERVICE FAILURES" to continue.

        2. Allan George Dyer

          Re: Makes me wonder how many others in the "playbook" have this capacity.

          "However it should be made a multistep process with plenty of Are You Sure? types prompts"

          So HAL was just working to design?

          "I think you know what the problem is, Dave"

      2. John Smith 19 Gold badge
        Unhappy

        "not something that can be done with a single mistyped command. "

        My point exactly.

        Yes servers have to be taken down. Yes sometimes clusters of servers have to be taken down. But it should be very rare that all need to be taken down at the same time.

        And it should be impossible to do so without whoever's doing it realizing exactly what is about to happen.

      3. Adam 1

        Re: Makes me wonder how many others in the "playbook" have this capacity.

        > Ultimately somebody has to have the power to do this because shutting down servers is a valid admin activity. However it should be made a multistep process with plenty of Are You Sure? types prompts

        How about "Please enter the shutdown validation GUID. This can be found on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard."

    3. Wayland

      Re: Makes me wonder how many others in the "playbook" have this capacity.

      One command is better than having to type a 100. 100 commands put into a file, we call that a 'program'.

  5. gv

    PEBKAC

    That is all.

  6. Anonymous Coward
    Anonymous Coward

    Next problem:

    "I'm sorry Dave, I can't let you do that"

  7. Your alien overlord - fear me

    I want to know want was the command they were supposed to enter and what did they actually enter.

    1. Anonymous Coward
      Anonymous Coward

      It's a super awesome convenience to be able to hit tons of machines in a big data center operation, but as you can see things can go wrong in a big way. It would be interesting to see a pseudo-syntax of what happened, if this was a webgui or a cli, or a script, what have you. I can tell you at the Yahoo! CEO shuffle I attended a few years back we could address wide swaths of machines, but most of the folks knew what not to do, and how to break up big jobs (ha!) into easy to handle tasks. For instance, my first task was to run a script that fixed the storage issue with NetApp "moosehead" disks that would cause it to loose data and the extra cool thing; not be able to recover from their RAID! Good times! This was on over 300 mail "farms" which were middle-tier mail handling clusters that did the sorting of mail vs junk/spam. The spam goes off to cheapo storage, and "good mail" goes to the main stores. Anyway, the IDs needing fixing to point mail user's mail to the new storage by running a script on close to 6000 machines, no VMs, all pizza boxes. No WAY was I to just go nuts and try and run them all at once, even though you could very well do that with Limo, their internal custom multi-host command tool, later replaced by a tool called Pogo. Clusters of machines could also be addressed with aliases, so I could say "all hosts in a group with a simple name"; turn off the flag to show availability to the VIP. For the script work I was clued in via change management meetings, then I ran the script on one farm to make sure it worked and that we did not clobber any users, then we did 10 farms, then 100, and the rest (are here on Gilligan's Island!). No problem. My goal was to not cause any issue that would make it into the news. :P I had nothing to do with the security also, which is a big embarrassment to their new owners, I'm sure.

      I was also in Search (AKA the Bing Gateway) and there we typically choose UTC midnight on Wednesdays to perform updates to the front end search servers. In the US there were two big data centers, each with two clusters of 110 hosts to handle the web facing search front end. For maintenance, you just choose a single host, take it out of the global load balancer, then update it, and drop it back in with extra monitoring turned up. If it does not crap itself, we could then take out half of a data center, do the update, put them back in, then repeat the process three more times for the other clusters, and that was that. But, yes, super easy to fuck up and take out every data center if you don't pay attention to your machine lists.

      1. Anonymous Coward
        Thumb Up

        It's a super awesome convenience...

        You could take down Bing or Yahoo! any time you like and for as long as you like for "maintenance" and pretty much no-one would ever notice. In fact, why not just leave them down and free up some server space?

        1. fredesmite
          Meh

          Re: It's a super awesome convenience...

          Quite honestly - if Bing , FB, google , yahoo , blah blah - disappeared would they really be missed ?

          They produce nothing other than hordes of advertising spam . Remember the days before that crap existed .. young adults could actually have a face to face conversation , working meant doing something other than browsing the internet for links to share among co-workers ...

      2. donk1

        6000 machines...so run 200 machines at a time for 30 times.

        What is this obession with 10,100,2000,rest and doing a massive population in 5 steps?

        Even if 2110 machines worked fine how long would it take to fix the last 3900 machines if enough of them broke?

        For failures it is not the number of times you have done it before but the size of the failure domain and how long it takes to fix.

        it should be possible to rollout automatically in small batches and even had multiple upgrades rolling out at the same time on an automatic schedule, ripple across the farm!

        If it is automated and scheduled who cares how many batches of upgrades are run?

        You would catch errors with less impact that way as the failed batch size would be smaller and it would be minimal extra work if designed correctly.

        This is the next stage in cloud service design - being able to have slower rolling upgrades with smaller batches!

    2. fronty

      rm -rf /

      1. Kevin McMurtrie Silver badge

        Funny, this should have finished while I was at lunch

        $ cd storage

        $ rm -rf tmp1* tmp2* tmp3 *

        1. muddysteve

          Re: Funny, this should have finished while I was at lunch

          >$ cd storage

          >

          >$ rm -rf tmp1* tmp2* tmp3 *

          That's always been the trouble with computers - they do what you tell them to, rather than what you wanted them to.

        2. Doctor_Wibble
          Boffin

          Re: Funny, this should have finished while I was at lunch

          When it comes to spotting mistakes, the first guess is probably the correct one - and having had numerous requests for file recovery over the years, the 'extra space' problem is not that rare.

          Perhaps oddly it seemed to be more common amongst people who did know what they are doing but didn't stop to re-inspect what they typed to see if they accidentally batted the space bar somewhere.

          Though at the other end of the scale, someone trying to follow unfamiliar instructions printed in a poorly-selected font where they have been told 'do this exactly' and it sure as hell looks like that's meant to be a space there...

        3. Colin Bull 1

          Re: Funny, this should have finished while I was at lunch

          It is very easy to set an alias for rm so that it lists all directories it is going to delete and asks you for confirmation first - simples

          1. Anonymous Coward
            Anonymous Coward

            Re: Funny, this should have finished while I was at lunch

            or just use " rm -i"?

        4. stu 4

          Re: Funny, this should have finished while I was at lunch

          I did similar thing about 2 months ago on my mac while trying to tidy stuff up in the root drive.

          UserTemp

          Usertemp

          ...

          sudo rm -rf User*

          hmm.that's taking an awful long time to delete some temporary crap....

          ..argh!@!!@#!@^#^

          CntlC CntlC CnltC

          Luckily good old timemachine got me back to an hour before and I had a 'Users' directory again.

          I have to say, in 10 years of mac ownership... one of the many many many times timemachine has got me out of a deep deep hole.

          I also remember one time, about 20 years ago - working for a large UK telecom company...needed to reboot one of the live boxes that handled 30% of the load of UK non geographic phone calls (0845, 0800, etc)...

          sudo shutdown now -r

          ....

          ...

          hmm can't seem to connect to that... doesn't seem to be coming back up..

          It was in an unmanned exchanged 30 miles from the nearest engineer.... had to get one of em to go out there, and press the ON button again.

    3. roselan

      rm -rf //

    4. TomChaton
      Alert

      re: ...and what did they actually enter.

      I suspect it had an asterisk in it somewhere.

  8. Dwarf

    This command will affect 13,432,454,456,234 objects . Are your sure ?

    Of course I'm sure, its pre-programmed I hit Yes when any pop up or confirmation is shown.

    1. Anonymous Coward
      Anonymous Coward

      We used to ask for double confirmation on important decisions like abandoning things.

      We soon learned that the second prompt had to have an inverse question - so a second "yes" was effectively a "no". That blocked trigger happy responses and made people stop and think.

      1. Anonymous Coward
        Anonymous Coward

        We soon learned that the second prompt had to have an inverse question - so a second "yes" was effectively a "no". That blocked trigger happy responses and made people stop and think.

        Works even better if you present the two dialogs in a random order...

  9. Daedalus

    Wur doomed

    The real Y2K problem was that in the year 2000 technology got big enough that there would never be enough wise people to look after it.

  10. Anonymous Coward
    Anonymous Coward

    SELECT * FROM EC3_Instance THEN DROP ALL$

    Beware the wildcard!

  11. Anonymous Coward
    Anonymous Coward

    Availability Zones

    What Amazon left out, and what El Reg didn't mention in their article 12 hours ago, is Availability Zones. You're not supposed to have to go multi-region in order to be able to sustain a major AWS outage. Being in multiple AZs is supposed to allow you to survive a fat finger by an AWS employee.

    The fact that Amazon's statement talks so casually about US-EAST-1 S3 makes it clear that there is no segmentation of S3 between AZs. If S3 isn't segmented that probably means other AWS services aren't either. Paid extra for multi-AZ RDS? Added extra EC2 instances for multi-AZ load balancing? It won't help at all if RDS and ELB are administered at the regional level anyway.

    I think Amazon has some splaining to do. If their own services aren't redundant across AZs then what is the point of customers paying extra to be in multiple AZs? Is the only independent component of AZs the power source? That is a far cry from Amazon's selling points of multiple AZs.

    1. diodesign (Written by Reg staff) Silver badge

      Re: Availability Zones

      We didn't mention AZs because S3 doesn't use availability zones. That's for EC2.

      C.

      1. Anonymous Coward
        Anonymous Coward

        Re: Availability Zones

        > We didn't mention AZs because S3 doesn't use availability zones. That's for EC2.

        Pretty much every service uses AZs except for S3. RDS, EBS, EFS, Elasticache, ELB. Maybe S3 doesn't because it was one of their original services. But it's worth asking why they haven't upgraded it yet. If they had, most sites that were affected by the outage would probably have been fine.

    2. jamesb2147

      Re: Availability Zones

      Also they're still physically the same datacenter, so susceptible to combinations of backhoes, bad weather, and poorly performing power cutover systems, etc.

      Using only one AWS region is a bad idea. Period. In fact, I'd argue (thanks, BGP hijacking!) that using only Amazon services is a bad idea. If that is too difficult to manage for you, then set the appropriate expectations with your business managers and users. Your product is too cheap to support that high of an uptime requirement.

      Amazon fails sometimes, Google fails sometimes, Microsoft fails sometimes (and in at least one instance took weeks to restore!)... don't put all your eggs in one basket, people. Don't be that guy.

      This whole fiasco is probably a good example of why developers should not be put in charge of the IT systems, no matter how "easy" they are... Operations teams tend to focus like a laser on uptime and stability, while developers are more interested in maximizing new features.

      1. Doctor Syntax Silver badge

        Re: Availability Zones

        "If that is too difficult to manage for you, then set the appropriate expectations with your business managers and users. Your product is too cheap to support that high of an uptime requirement."

        We keep hearing people saying things like this. And we have to keep replying that marketing has set inappropriate expectations with these very people who are the ones who make the decisions. They've been told that cloud someone else's computer is cheap and that it's resilient.

        "This whole fiasco is probably a good example of why developers should not be put in charge of the IT systems"

        To some extent I take objection to this. Back in the day it was possible to be in charge of development and operation and be paranoid about stability and uptime. It encouraged not developing what you knew you couldn't run. Times have changed and not, I think, for the better.

        But some cloud someone else's computer usage is shadow IT, paid for with a company credit card by people who don't see the need for all the costs and time needed for the detailed stuff which enables in-house developers and operations to combine to provide reliable systems. Don't assume either real developers or operations get anywhere near such deployments. Again, sales and marketing by providers have to take some responsibility here.

        And whilst you're extolling operations, don't forget it seems to have been Amazon's operations staff who grew fat fingers in this instance.

      2. Haberdashist

        Re: Availability Zones

        > Also they're still physically the same datacenter

        No, each region is made of many data centers. US-EAST-1 is spread across Northern Virginia.

        > Using only one AWS region is a bad idea. Period. In fact, I'd argue (thanks, BGP hijacking!) that using only Amazon services is a bad idea. If that is too difficult to manage for you, then set the appropriate expectations with your business managers and users. Your product is too cheap to support that high of an uptime requirement.

        >

        > Amazon fails sometimes, Google fails sometimes, Microsoft fails sometimes (and in at least one instance took weeks to restore!)... don't put all your eggs in one basket, people. Don't be that guy.

        Have fun living on your planet where everyone has the budget and time for multi-provider multi-region setups. It's one thing to chide people for not having proper backups or never considering HA, but expecting every site to launch their own satellite to maintain continuity in case the internet fails is pretty pointless.

        1. dancres

          Re: Availability Zones

          Those that don't have the budget presumably are spending it on features? That's not about cost that's about where one believes the revenue is ie features. However, if you're down, your features don't get used. A similar argument can be made for time expended in building HA: You can expend engineering effort once or support and admin effort every time you're down.

          Ultimately, this is about your users. Do you care enough about them to put their fate and yours in another's hands or do you choose to use the available facilities (and if you used a DR style arrangement you could save much of the infrastructure cost until time of need, magic of elasticity) to protect everyone?

          No doubt, for a fledgling company the choice has to be features but it should be a knowing choice. Amazon make it clear what needs doing for HA, choosing not to do it is on the respective business owner. For those with a decent paying user base the balance is somewhat different, all about how much you value your reputation. Blaming Amazon for your downfall will be limited consolation for your users. If you fall victim often enough you'll be paying the cost in lost revenue through inaction and support interactions. Alternatively you can pay the cost of moving clouds or developing your HA options.

  12. Anonymous Coward
    Anonymous Coward

    Isn't puppet , chef , and Jenkins .. CI/CD .... devops

    Suppose to cure this type of HUMAN fkck ups ?

    1. zanshin

      Re: Isn't puppet , chef , and Jenkins .. CI/CD .... devops

      "Suppose to cure this type of HUMAN fkck ups ?"

      In a word, no.

      Those tools and the processes they support are for automated testing of changes you plan to roll out, and automated deployment of those changes, hopefully after someone or something has approved them. They make replication of change across many environments simple, including setup of servers, environments and so forth.

      The people in question were carrying out triage on a production performance issue. "Infrastructure as code" isn't really that helpful during triage. You usually have to dive in and run commands by hand. In such a situation, if what you are trying to resolve is related to production load and scale, you probably cannot replicate it on-demand in a test environment, even if you'd like to. That, in turn, can mean you can't really usefully test the command you plan to run.

      Given the nature of AWS/S3, I'm quite sure the command line entered did something heavily automated at scale, and might well have been executed with their equivalent of something like Chef, but *what* it was told to do was likely derived from the triage efforts. You can bork your production environment just fabulously with the wrong command inputs to a tool like Chef. It will dutifully obey you if the command you give it is legit. (They mention that they will change their definition of what's legit based on this experience.)

      I certainly do run what I perceive as "dangerous" commands in test environments before I run them in production, just to make sure I got them right. I can then copy-paste them exactly from dev into prod, at least where the command will be identical in either environment. But if I don't think the command is dangerous, possibly just because I've become used to running it without failure, I could conceivably type it out in full confidence and still screw it up. Triple-checking yourself before you hit "enter" is a matter of experience and, too often, not being over-tired or in a rush.

      1. fredesmite
        Mushroom

        Re: Isn't puppet , chef , and Jenkins .. CI/CD .... devops

        ...

        Certainly Agile baby sitting with story book of post-its on a white board would have prevented it ....

    2. Anonymous Coward
      Anonymous Coward

      Re: Isn't puppet , chef , and Jenkins .. CI/CD .... devops

      No. As long as someone uses a CLI he or she *will* make mistakes. Especially when the switches/parameters you have to set have a man page that looks like the Encyclopedia Britannica, and the average command line is just a little shorter than "The Rime of the Ancient Mariner".

    3. 1Rafayal

      Re: Isn't puppet , chef , and Jenkins .. CI/CD .... devops

      Hmm, no, it isnt.

      DevOps is intended to support developers. Clue is in the name.

      1. Anonymous Coward
        Anonymous Coward

        Re: Isn't puppet , chef , and Jenkins .. CI/CD .... devops

        You are clueless - dev - > test -> production .. repeat .; . the CI model

  13. Anonymous Coward
    Anonymous Coward

    This is why spending money to go beyond 4 9s is generally wasted

    Unless you have ironclad procedures (which would include prepackaged scripts to do all such tasks, so command line access is available to virtually no one) you'll lose your 5th 9 due to human error, 9 times out of 10 9.

  14. Mage Silver badge

    Wizards know

    1 in a million miracles happen 9 times out of 10.

    Or something.

    Next time it will be a rush to release patch that is auto updated. Perhaps like HP toner or ink cartridge DRM it won't be obvious till later.

    Beware potato based Cloud computing.

  15. Anonymous Coward
    Anonymous Coward

    Oops

    I work for a large bank. I once took all of the ATMs off the air by entering a simple command to empty a load library on the mainframe. I was asked to do it by the application expert coz he knew what he was talking about and I had the access.

    Oops. Came half a bee's dick from losing my job.

  16. Alan W. Rateliff, II

    -Confirm:$false

    Now always your friend.

  17. Brenda McViking

    I remain impressed

    By the ability of amazon to do a route cause so quickly and go public with it.

    In most corporates I've worked with it would take them at least 3 months to figure this out, even with C-Suite backing, and they'd only admit it 2 years later, because lawyers or something.

    Makes a breath of fresh air that they have kept us informed. Unlike say, every bank ever, or talktalk, or adobe. Although naturally they've used up 5 years of their standard 99.99% availability quota in a single day so I'm by no means advocating they get supplier of the year... Just that others might learn that this is the proper way to keep users informed after a crisis.

    1. Broooooose

      Re: I remain impressed

      "By the ability of amazon to do a route cause so quickly and go public with it"

      If you're a CIO and the whole company is betting on your strategy and you choose to go with a provider who fails and it take them 3 months to figure out and report on that failure, then you're gonna be asked to move off it pretty quickly and your credibility goes down the swanny.

      If AWS want to maintain their leadership and gain their customers trust, they have to be transparent and quick to resolve. And yes, it is impressive. But I don't think they get a choice.

  18. Anonymous Coward
    Anonymous Coward

    Wait! They want me to use their automated tools..

    ..while they're doing stuff manually? hmmm.. what'd I miss?

  19. smartypants

    I fixed it!

    Without lifting a finger.

    This cloud stuff is brilliant!

  20. Joe H.

    The website is down dude...

    The outage reminded me of this,

    https://www.youtube.com/watch?v=W8_Kfjo3VjU

  21. Potemkine Silver badge
    Joke

    Pity for the poor SOB

    [rumor] I heard someone at the IT was displaced to Amazon's warehouse on Northern Alaska to wrap toothpicks 8 hours a day [/rumor]

  22. EnviableOne

    Cloud Servers Other Peoples Tin (OPT) is just as likely to go down as your own, due to Layer 8 errors.

    The problem is their marketing machine says "dont worry if Some idiot does it, we got some more Tin over here that we'll move your stuff to."

    but as this demonstrates no one told ops that

  23. DaddyHoggy

    "You're about to break the Internet. Are you sure? Y/N"

    Y

    "No, seriously. This will expose Cloud Based solutions as the delicate soap bubble it is. Are you sure? Y/N"

    Y

    "Sigh... OK..."

  24. rh587 Silver badge
    FAIL

    Could have been worse

    Who remembers this epic typo in 2014?

  25. Anonymous Coward
    Anonymous Coward

    Oh well, take learnings, move on, this kind of thing happens every day in companies all over the world (though hopefully only once per mistake.) It's only more noticeable because of scale.

  26. wyatt
    Facepalm

    They're not the first and won't be the last. I'm guilty of taking servers down instead of workstations due to 1 character being different, the knock on effect can be massive. Limiting this is essential along with recovering from it.

    1. Locky

      Me to

      Once powershelled the entire company to have an out of office saying "I have now left the company"

      That was the day I leaned that with one specific get-mailbox filter, if it finds zero results, it selects all mailboxes. Thanks for that M$

  27. DropBear
    Trollface

    "...limiting the ability its debugging tools have to take multiple subsystems offline"

    ...from now on, they'll need to use "sudo".

  28. HurdImpropriety

    Move everything to the cloud yet have a single point of failure...nice.

    "Those two subsystems handled the indexing for objects stored on S3 and the allocation of new storage instances. Without these two systems operating, Amazon said it was unable to handle any customer requests for S3 itself, or those from services like EC2 and Lambda functions connected to S3."

    Move everything to the cloud yet have a single point of failure...nice.

  29. Bowlers
    Facepalm

    I wonder

    I wonder how long before the fat fingered one feels confident enough to report this to EL REG's On Call?

  30. TeeCee Gold badge
    Facepalm

    Hmm. "Playbook".

    That'll be the source of the problem. When such is in use, it can only mean one thing.

    The person typing the commands, while "authorised" to do so, almost certainly hasn't got a clue what they actually do.

    If they did then a) they wouldn't need some else to have written it down for them and more importantly, b) they'd have spotted the typo before hitting Enter.

    1. fredesmite

      Re: Hmm. "Playbook".

      So airline pilots should skip pre-check list ... because only rookies would need that.

  31. Anonymous Coward
    Anonymous Coward

    If we give a human the power to destroy at some point they will do it - usually by accident

    First an admission, I have performed too many of the "fat finger" events over my lifetime. Back in the day when most of this was new it was accepted. Technology and its intricacies have changed dramatically and leaving the mere mortal to the capabilities of unrestricted command lines leaves us open to the next fat finger event.

    Solutions to the problems have existed for a long time, it has had many names, currently it is called "service orchestration" but companies still need to invest in it and technicians have to embrace it. Simply put it allows any technologies command facilities to be exposed but in a controlled state. Removing the potential for the next "fat finger" - for any company where the technology is the company there is no excuse in blaming the human - it is time for the company to own up and say we didn't support the human - we provided the capability for this to happen and as always will - it happened.

    So to Amazon and any other high tech company providing critical services I would ask - what are you doing to make certain this never happens again - get the cheque book out and remove the potential for it to happen in the first place.

  32. russmichaels

    good on them for being honest about the cause and not trying to blag everyone. these things happen.

  33. Anonymous Coward
    Anonymous Coward

    All I can say is...

    Looks like some companies are going to reconsider cloud and bring back their data on-prem. Good day to be a salesman =p

  34. Anonymous Coward
    Anonymous Coward

    /dev/sda1 has gone 2323 days without being checked, check forced

    BTST. Alternative good times.

    These system were probably around since Amazon first needed a storage system. And never mind that the variables in the code all refer to books. That's historical legacy code, the original developers have left and you'd better not touch it.

  35. Anonymous Coward
    Anonymous Coward

    The internet *is* "someone else's computers"

    Sorry, I really get annoyed by this "other people's tin" quark.

    You cannot run Internet services on your own computers alone. If a major ISP mucks up their routers and goes dark, your precious customers will not be able to see your website, no matter how much disaster recovery you planned for.

    AWS is a "cloud" hardware store. They stock some nice tools and might offer advice on your design questions. You're still responsible for building a stable system yourself.

  36. quxinot

    It suddenly occurs to me the primary issue with cloud stuff.

    Cloud is way cheaper than doing it on site. It's also comically, laughably more expensive than having it under your own roof. The differences are pretty simple: Doing it right is significantly more expensive than doing it wrong--to the point that the tools you're using aren't important. If your cloud stuff goes down because you don't have geographic failover redundant whatever etc because you cheaped out, you did it wrong. Holds precisely as true as the day the roof leaks and your rack emits a sad little pfzzt sound.

    I do wish I lived in a world, or even a location, where this magical internet of perfect connectivity at wonderful speed was available, though. The last thing I want is any important data or processing being done on the other end of an unreliable, tiny, crooked straw. Someday, someday...

  37. This post has been deleted by its author

  38. Anonymous Coward
    Anonymous Coward

    Put it in a shell script

    At our company, the rule was to put anything potentially bad in a shell script, show it to someone else first, and no $* in the script.

  39. Anonymous Coward
    Anonymous Coward

    Was it an outsourced Sys Admin from India, who did this?

  40. Anonymous Coward
    Anonymous Coward

    for those who haven't done :

    [root] : rm -rf /* when you meant

    rm -rf ./*

    Have not really experienced life as a software guru.

  41. Anonymous Coward
    Anonymous Coward

    FLUTTER OF BUTTERFLY WINGS

    hows that go, a slight disruption of air over here, leads to a hurricane over there...?

    its all connected folks, so get ready when it all comes crashing down..

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like