back to article Tuesday's AWS S3-izure exposes Amazon-sized internet bottleneck

Amazon’s S3 outage is a gift to Azure and Google, on-premises IT, hybrid cloud supporters and multi-cloud gateways. But it has also exposed inadequate business continuance and disaster recovery provisions by Amazon's business customers. All of them can point the finger at Jeff Bezos and say AWS let users down. And now we know …

  1. Gordon Pryra

    Lies and Urban Myths

    Business Continuity and Disaster Recovery don't exist. Its an urban myth.

    I have seen companies spending hundreds of thousands on DR for DEVELOPMENT systems that would NEVER have worked. Yet in 20 years I have seen about 2 companies who had a REAL WORKING disaster recovery system in place.

    Anyone who tells you that their company has any form of real BC beyond a few backup tapes and some telephones for the sales weasels are just lying to protect their job, and that job is ticking things off on a check list and passing reassurances up the line so the people above them can tick the same things on a different bit of paper.

    1. Anonymous Coward
      Anonymous Coward

      Re: Lies and Urban Myths

      A previous employer tested their DR a couple of times a year. Sometimes they'd put switch to their backup office location, and make sure everyone could still work OK. Other times they'd switch over to use the backup datacentre.

      I'm fairly sure they did the datacentre switch in the middle of the day. While there may have been some sweaty palms as they flicked the switch, it was worth it from the point of view of proving business continuity.

      They treated it seriously, and did not hold back from improving things if someone spotted a flaw.

      Having said that, it is the only place I've seen it done anywhere near properly.

      1. Anonymous Coward
        Anonymous Coward

        Re: Lies and Urban Myths

        Having said that, it is the only place I've seen it done anywhere near properly.

        I have multiple customers with working DR sites that are used just like that, in the hotel, finance and oil industries, and in government. They run drills, and switch over regularly to test. We get an ear-bashing if anything goes wrong during such tests (which, fortunately, isn't too often).

        By and large the people who think they have BCDR solutions, but don't, are the ones who buy (or are sold) the equipment & then phone up consulting services to say "I've bought your XXX, how do I configure it to protect my company?"

        The ones who do have working setups are those who employ specialists to do business analysis and risk assesments, and then buy the right kit for the job. It's not a cheap process, so it tends to be the big companies, who do understand the problem. It would appear that Amazon is not on that list.

        1. Anonymous Coward
          Anonymous Coward

          Re: Lies and Urban Myths

          Amazon absolutely are on that list - they provide many different data centers for failover and continuity and provide multiple tools to enable that. Yes, it's more expensive to deploy this way and, yes, it's more complicated. The blame for the failures of applications using AWS is 100% on the customers who cheaped out and didn't build resiliency into their deployment architectures.

          Amazon have some lessons to learn about their own status and notification tools and making sure they operate during an outage, but overall this has been a valuable lesson in identifying those companies that are too stupid or miserly to use the proper tools available to them. Plus some truly shocking deployment decisions - web fonts stored on S3? Config files for a mouse? Complete failure of security systems when one AWS node goes down? FFS, are all your developers 19?

      2. Anonymous Coward
        Anonymous Coward

        Re: Lies and Urban Myths

        My previous employer had 2 data centres 25 miles apart. I built active/passive clusters with data mirrored at both sites using Veritas Cluster Server. DWDM between sites providing SAN and WAN bandwidth so we could replicate and heartbeat.

        DR testing was a a simple matter of failing over the cluster. We could run from either site and swapped between the 2 on a regular basis to prove DR worked. This was over 10 years ago. The same infrastructure also allowed us to run backups from the passive and active site giving you instant on-site and off-site backups

        The solutions these days seem to frown on a high cost but highly resilient solution like this. Sadly this applies to the systems replacing the ones I put in. Nowhere near as resilient and no chance of running from either site to prove DR.

        1. Anonymous Coward
          Anonymous Coward

          Re: Lies and Urban Myths

          Maybe it's because I've worked so much in New Zealand, but we always design and test DR and our clients there take this seriously. When you live in an active earthquake, volcano, and tsunami zone you tend to be more aware that "disasters" happen pretty regularly and plan for them.

          I wish I could say I'm shocked and surprised by the complete lack of awareness of BC and DR planning and testing among many businesses outside that region. But then I remember a conversation with a CTO where he told me that VMware was their DR solution since they could just reload the server images from storage (despite there being no offsite replication, no tape backups and rotation, no external power backup, etc. etc.)

        2. Aitor 1

          Re: Lies and Urban Myths

          IF DR cannot be proved, I would say it is proved.. not to work.

    2. Anonymous Coward
      Anonymous Coward

      Re: Lies and Urban Myths

      People who want to know at a techyish rather than marketing level about what goes on in BC and DR done properly, and about what interesting things can go wrong when BC/DR related design is not done properly, might want to have a look at obscure but informative and educational websites such as the Availability Digest (other suggestions welcome). AD doesn't always mean Active Directory, even if that's wot Gordon the Firefighter thinks.

      http://availabilitydigest.com/

      Yes I know it doesn't look like the kind of website that your typical Presentation Layer Person would approve of. Sometimes focusing on content and structure, rather than presentation, can be a good thing, even in 2017.

      1. Locky

        Re: Lies and Urban Myths

        Obligatory Dilbert

        It's only funny because it's true...

      2. Gordon Pryra

        Re: Lies and Urban Myths

        Not only a Firefighter, but well versed with working with the NHS and even some brief encounters with the Police thank you very much! (I doubt many people will get that reference though AC)

        Yes yes, there are ALWAYS some exceptions to the rule, its just a pity the actually important systems owned by institutions that actually matter are never those exceptions.

        Who gives a monkeys if some consultancy or private company is well looked after by its staff, I can state with some certainty that the infrastructure running this country is not.

    3. People's Poet

      Re: Lies and Urban Myths

      Nah.... you're just talking a load bollocks.

  2. regregular

    Amazon should shut down datacenters on a rotational basis every day of the week until the duplication message has been well massaged in.

    And maybe we should add a cloudfree Monday to our schedules as well. It is just not understandable why a mouse should give you a headache just because the control app can't get a connection. Or why you can't turn on your damn Philips Hue bulb without a connection to their servers.

    The cloud ain't bad, but many of the developers who fabricate stuff like that are. The benefits and limitations have to be understood by developers and manufacturers, there needs to be a mandatory IoT firmware/control app QA step that simulates an internet outage and checks whether the software is dropping a bollock and this needs to be understood before something really critical borks out.

    1. Arthur the cat Silver badge

      Amazon should shut down datacenters on a rotational basis every day of the week until the duplication message has been well massaged in.

      The problem with that is that the lazy would simply move to another cloud provider that didn't do that, and then whinge when that one had a failure that bit their arse.

      If you're a big enough player you really should be using tools like Netflix' Chaos Monkey/Gorilla/Kong trio to prove you're truly resilient. If you're not a big player, the truth is that AWS is probably not for you. From day one, AWS documentation has always warned that you should be prepared to handle outages.

    2. Voland's right hand Silver badge

      Will not help in this case

      1. It was not a full blackout - it was a brownout. You cannot emulate this by shutting down a datacenter

      2. The effects on customers are a "Cloudified" (ughh... language vandalism) version of an outage for a centralized SAN/NAS design. You put all of your eggs into one basket you get the expected result. Is the basket called Amazon, EMC, Pure, Tintry or "FairyDust Storage" does not matter. The results are all the same. You cannot fully counter this by having a redundant storage, you have to be redundant by design - something you will never be if you go for a centralized single storage design (there is just way too many ways to nuke that). The question is different - is it worth it for what you are doing to have a centralized storage - you do a cost/benefit analysis vs distributed and if that shows that you can survive commercially an occasional outage you leave the users to scream and shout.

    3. richardcox13

      > Amazon should shut down datacenters on a rotational basis every day of the week until

      > the duplication message has been well massaged in.

      Which would also penalise those use cases where a few hours of downtime is not a problem. Not everything needs 24x7 uptime, there are plenty of cases where a 6 hour outage us not a problem but 24hours would be.

      No cloud providers say you get DR without some work at the client end; equally non-critical use cases shouldn't be blocked.

  3. jMcPhee

    But, the primary point of 'the cloud' is to be a cheapskate

    1. 1Rafayal

      Guess you have never tried to run your infrastructure in the could then...

  4. LeeAlexander

    In an ideal world

    Yes in an ideal world where the grass is always green and the Sun shines 24/7 we all would have "hybrid", cloud solutions...BUT you are talking about doubling the cost of your cloud bill not to mention the development costs *trying* to create, test and run "hybrid" apps.

    Dream on...

    1. xeroks

      Re: In an ideal world

      but the point is ... your cloud solution has not been costed correctly, if it doesn't have that DR element.

      While your accountants love cheap cloud it while it's all working, they won't if revenues start going down the drain.

      Essentially you're placing a bet that it all keeps working. Depending on your business, that might be a risk you can deal with.

      1. Anonymous Coward
        Anonymous Coward

        Re: In an ideal world

        "While your accountants love cheap cloud it while it's all working, they won't if revenues start going down the drain.

        Essentially you're placing a bet that it all keeps working. Depending on your business, that might be a risk you can deal with."

        Some businesses might be able to deal with it, others might be less lucky when things start having "high error rates".

        In many cases it's not the service provider that picks up the financial impact when the service provider's stuff is not working right, it's the end customer (often one whose non-technical people had mistakenly believed the cloud hype).

        When (if) the service providers are liable for the financial impact of their "high error rates", some behaviours (and products) might change.

        When do you think that might happen with an external (3rd party) mass market service provider?

        1. Jellied Eel Silver badge

          Re: In an ideal world

          .. Business owners understand the cost of failure. So direct costs from lost productivity and reputational damage. Unless businesses can state how much a protracted outage will cost them, they can't sensibly decide if a DR solution is 'too expensive'. Sensible business owners understand this, and work with suppliers to design solutions that meet the operational requirements. Less sensible ones buy DSL connections to save costs, but then wonder why their office/store/factory/warehouse connections fail.

          Sensible business owners also understand that there might be.. limitations with low-cost cloud solutions, especially if they're 'one size fits all', which can make integrating DR a lot harder, especially if there's software that may not play nicely in a virtualised/containerised world. But it is possible, ie one client decided to go with a private cloud solution that allowed synchronous replication between two widely seperated data centres. Which was pretty awsome demonstrating DR invocation to the client, and them not noticing the cutover.

        2. Anonymous Coward
          Anonymous Coward

          Re: In an ideal world

          When (if) the service providers are liable for the financial impact of their "high error rates", some behaviours (and products) might change.

          But they won't. Because if you try to sign a contract then you explicitly waive them of any responsibility whatsoever. And if you try to push them to accept some accountability - they refuse to sign the contract. They're not stupid!

  5. SaleNowOn
    Meh

    Does it really matter?

    Granted for the really important stuff you want DR sites and duplication of data centres, but for companies like imgur, does the loss of 20 terabytes worth of cat pictures in superman outfits warrant the implementation of a full DR solution?

    1. Anonymous Coward
      Anonymous Coward

      Re: Does it really matter?

      Dude...some of the smaller affected sites might have been hosting porn. OF COURSE, a full DR solution is warranted.

  6. A Non e-mouse Silver badge

    Optional DR/Resiliency

    All this talk about cloud costs going up to cater for resiliency isn't new. It applies to your on-premise systems too.

    If you think that DR, resiliency, etc are too expensive, that's fine. That's your value judgement. But you have to assess the impact to your organisation/business on an outage and make an informed decision.

    Just don't blame Amazon, Azure, etc. if you decided not to buy resiliency and their systems go off-line.

    1. Lee D Silver badge

      Re: Optional DR/Resiliency

      The number of times that I've had to explain this:

      If you want a backup system, it will cost you what the real system cost, again, and a bit more for whatever tech to make it fall over.

      And, yes, that functionality, hardware, processing power, storage, etc. will NOT be available to you to use. It will literally be idle (from a user point of view, but hopefully replicating etc.!) most of the time.

      If you want something that tolerates a failure, you have to buy two of them and one of them does nothing all day long but wears, depreciates and costs just as much as the first. If not, it's not a suitable replacement.

      And then you get into the depth you take this to - a redundant disk is just another disk. A redundant array is just another array. A redundant server is just another server. But a redundant site is another site. A redundant datacentre is another, fully-funded, fully-functional, datacentre. That sits and does nothing but can break in exactly the same kinds of ways over time.

      And then you have to have a controller card, or another storage array, or a licensing for the server and software to make it failover, and site-failover logic and hardware, etc. on top of that cost.

      I'm currently working at a place that can put a value on their data. They very nearly lost everything, and it would have cost an awful lot to get back running, let alone try and get their data back. Thus their DR is "proper" as they realised how much it would have cost in time and money, realised how much it would cost to avoid that (including my salary, for instance) and choose the "good" side of the coin.

      As such, despite being a tiny employer by global standards, we have remote sites, remote servers, remote backups, full remote operation in an emergency, redundant leased-lines, redundant cabling around the site, redundant servers and all the logic to tie this together nicely.

      But to secure System A against failure requires System A and System B of the same spec - MINIMUM - sensibly System C and maybe System D as well, plus the additional licensing and logic to fail them over and complete copies of EVERYTHING on them all. So you would have to pay 2-5x the total price your system cost originally, just to do a basic job of it.

      When you do the maths, that STILL works out better than data loss, however. But nobody ever costs data loss properly until it happens and they realise how much it REALLY costs in terms of lost custom, legal requirements, hassle, time and money, the complete INABILITY to recover some data (no, you can't just post it off to 'a specialist' and expect anything to come back except a bill), etc.

      1. Doctor Syntax Silver badge

        Re: Optional DR/Resiliency

        "If you want a backup system, it will cost you what the real system cost, again, and a bit more for whatever tech to make it fall over."

        There are different degrees of the extent to which you might need to provide resilience. What you describe is a full-fat failover. However, it might be satisfactory to have a DR contract which is much cheaper - you have access to a system which might actually be bigger than your own on which to restore your backups. If your DC is in your normal premises the time to get the DR contract invoked and running might be of the same order as the time taken to get your staff established in whatever temporary offices they've made provision for. OTOH this wouldn't suit a business which has to be running 24/7. But one size doesn't fit all.

      2. Anonymous Coward
        Anonymous Coward

        Re: Optional DR/Resiliency

        So there's absolutely no way no how whatsoever that two systems on two sites can share workload in an active/active manner, not necessarily two (or more) identical systems, and if one site goes down the design can be such that only critical parts of the workload are continued on the remaining site(s), as many as the surviving site can cope with, thus continuing to offer some kind of support for critical activities while not requiring total dual-site replication of hardware (and software)?

        It was, and is, entirely possible to do that kind of thing. You can often make a variety of different cost vs availability tradeoffs, depending on what the organisation wants. But it may require a little thinking outside the box, from customers and their IT providers. On the other hand, hype is simple, and hype sells, and by the time it needs to deliver, the vendor IPO is long gone and the salesmen have moved on.

        1. Phil O'Sophical Silver badge

          Re: Optional DR/Resiliency

          So there's absolutely no way no how whatsoever that two systems on two sites can share workload in an active/active manner, not necessarily two (or more) identical systems, and if one site goes down the design can be such that only critical parts of the workload are continued on the remaining site(s),

          Of course you can, but then you don't have 100% resilience, you only have resilience for those parts of the infrastructure you deem to be critical. This is the whole principle of Business Continuity Management. Decide what's important, decide what risks you need to protect against, work out what it will cost, and do the cost-benefit analysis. If you need 100% resilience, you need a duplicate site. I know of companies where their DR site is used for report generation, background processing, and other non-critical work. It takes the load off the primary, on the understanding that in a crisis that low-priority stuff gets canned so that the DR site becomes Production.

      3. Anonymous Coward
        Anonymous Coward

        Re: Optional DR/Resiliency

        "If you want a backup system, it will cost you what the real system cost, again, and a bit more for whatever tech to make it fall over."

        False, an outright lie. A backup isn't typically even similar than the primary system, it's a backup.

        It has only the functionality to provide basic functions while the primary system is down, therefore needing a lot less capacity and no reserve capacity or internal redudancy, i.e. much cheaper.

        If you meant full redundancy, then you are partially correct: It's a mirror of primary system.

        "And, yes, that functionality, hardware, processing power, storage, etc. will NOT be available to you to use. It will literally be idle (from a user point of view, but hopefully replicating etc.!) most of the time."

        This is also false or you just don't do any software development: Of course it's available for use, that's trivial.

        Second set of hardware is ideal for any kind of testing the software, from performance tests to penetration tests and you can do all of that without interfering the main function, redudancy.

        It also has up-to-date production data, very useful for testing. Of course you can't delete or update anything existing in primary system, but you can have additional data and you can do whatever you want with it.

        This applies partially to a backup system too: If the software is fast enough in the backup system, it's fast enough in primary system too.

  7. macjules

    Am I allowed to mention ..

    Oracle Cloud. We believed the blurb that it is 100x faster than RDS/Aurora for MySQL, signed away our life (well, ok £108 per month for unmetered) and now get more notifications than can handle. Occasionally tucked away is an outage status notification.

    AWS might have hiccups, but the others are certainly no better.

  8. Individual #6/42

    The best resiliency I saw

    was so robust that the director demonstrated it by literally pulling the plugs on the machines one by one and watching the rebalancing to another site.

    I asked what happened if we started at the other end of the farm. We pulled a few plugs and watched as the lauded system fell over and cried (metaphorically).

  9. Mage Silver badge
    Big Brother

    Amazon’s S3 outage is a gift to Azure and Google?

    "Amazon’s S3 outage is a gift to Azure and Google, on-premises IT, hybrid cloud supporters and multi-cloud gateways. But it has also exposed inadequate business continuance and disaster recovery provisions by Amazon's business customers."

    No, it's no gift to Azure and Google or any other Cloud seller. Anyone with any logic will realise that inherently the issue is the same for all.

    This: ->

    exposed inadequate business continuance and disaster recovery provisions

  10. Merrill

    Licenses are a big deterrent to proper BC/DR implementation

    Too many vendors of proprietary software infrastructure charge full freight for deployments on the passive side of active/passive implementations. Even with active/active implementations, there are additional costs for underutilized licenses, and the design and operation is more complex.

  11. Anonymous Coward
    Anonymous Coward

    Have we heard the full story?

    I doubt we have heard the full story so far.

    I was having problems creating load balancers on Monday afternoon GMT (provisioning taking 30 minutes before failing.) in the US-EAST-1 region.

    Likewise, when we have found AWS issues in the past, anything that gets published by AWS always appears late, and seriously down plays the issues at hand.

    1. theblackhand

      Re: Have we heard the full story?

      From the various things I have seen so far, the issue appears to be related to the scale of the US-EAST-1 availability zone - it's AWS's biggest. My guess is that's because of the size of the East coast business but suprised more companies haven't moved to US-EAST-2. Reddit commentards have noted that most of the AWS issues in recent years have centred around US-EAST-1.

      The problem appears to have been triggered by a network issues and resulted in one of the North Virginia DCs basically going offline. The migration of workloads to the other two DC's within the availability zone then resulted in "high error rates" which I guess is overloading of network links or storage bandwidth to complete the migrations. If there were any other issues that compounded the problem (i.e. maintenance or faults/outages on inter-DC connections) then it may be a case of "n+1 or more links isn't sufficient to cope with the potential issues we see".

      The post-mortem is due on Monday - should be an interesting read.

  12. Christian Berger

    With your own infrastructure...

    ... you can at least fix stuff when it's broken. With a cloud solution you have to hope that the cloud provider knows what its doing.

  13. Anonymous Coward
    Anonymous Coward

    Cloud = Eggs in one big basket

    If you are stupid enough to run your business out of a cloud you deserve everything you get.

    Any serious issue with a major cloud provider now brings a massive number of companies to their knees.

    At least when companies run their own data centres in house at least everything is separate and compartmentalised. Companies are just jumping into the Cloud like sheep.

    It's just a matter of time before the first massive DDOS attack on Cloud providers occurs. That'll be interesting.

    1. Will Godfrey Silver badge
      Unhappy

      Re: Cloud = Eggs in one big basket

      My first thought was that maybe this was exactly what happened here.

    2. andyheat

      Re: Cloud = Eggs in one big basket

      "It's just a matter of time before the first massive DDOS attack on Cloud providers occurs. That'll be interesting"

      Like the one against Linode in 2015 you mean, that took most of their data centres down a number of times over the Christmas period?

      http://www.theregister.co.uk/2016/01/04/linode_back_at_last_after_ten_days_of_hell/

  14. Rob Isrob

    The value prop takes a hit

    The problem is in the SMB space. If they have to live in two "replicated" regions, suddenly the reason for going to the cloud in the first place takes a financial hit. Might as well stay local (in many more cases, certainly not all) and continue to send those tapes offsite and hope for the best. The smarmy fella that posted in another comment at El Reg that: "We just flipped from Northern VA to Frankfurt, took us all of 3 minutes" well goody for you all. They are large Enterprise and designed appropriately - duh. Same for Netflix, after an embarrassing AWS induced outage years ago, they spread the love across regions, no problem. The AWS sphincters tightened up quite a bit in the SMB space after yesterday.

  15. jamesivie

    Regional Replication Failed...

    We have all our data replicated to the us-west-2 (Oregon) region and have done successful practice recoveries. We were unable to access that region either. Epic Fail!

  16. Anonymous Coward
    Facepalm

    Fooled us too

    We only use AWS for nonessential services. Our site works fine even when blocking all 3rd-party crap via uMatrix. I thought we were safe.

    Nevertheless, our whole site was borked by a plugin loading an unused 3rd-party JS file, which WAS being served, but only the first ~1k of it.

  17. Anonymous Coward
    Anonymous Coward

    KILS

    I propose a new approach, Keep It Local Stupid (KILS) with multi-path communications between the KILS-A and KILS-B sites.

    Putting a valuable company asset, often ignored by suits (SAS, Suits Are Stupid) in the hands of a third party is SSS (Stupid Stupid Stupid), but then who believes the IT person.

  18. lamont

    Really this is almost a non-issue to people who aren't perfectionists

    Outages are going to happen. If the only time this year your service crashes is when everyone understands that half the internet is offline because S3 is down, then you've got a ton of air cover on this one with your customers, who are probably themselves struggling with S3 being down.

    Far from being a rationale to get off of Amazon, this is a good rationale to be using Amazon for their services. If you're a small shop you can go the Amazon route and only get hosed when Amazon borks up the whole internet -- or you can try to set out on your own and then when you inevitably screw it up because you've got vastly fewer resources than Amazon at your disposal you have nobody else to blame.

    There's this third option which people think they have which is that its simple enough that they'll solve it all, but inevitably the complexity of your software and systems will bite you hard some day or night no matter how smart you think you are (and likely the fact that you think you should be able to reach perfect operational uptime indicates that you don't understand the uncertainties and you'll be much less likely to succeed). And something that kind of aspie perfectionist IT people don't understand is that the company that spends enough resources to try to get a flawless operational record will get beat by the company that spends enough to get by and diverts the freed up resources to other efforts to capture customers. So if your UI hasn't been updated in 10 years but your uptime is 100% and your backups are flawless you're probably sinking in the marketplace. And if you had the resource to do it all, you'll probably be a Fortune 500 company like Amazon -- the rest of us have to made business trade offs.

    1. Anonymous Coward
      Anonymous Coward

      Re: Really this is almost a non-issue to people who aren't perfectionists

      "Re: Really this is almost a non-issue to people who aren't perfectionists"

      Really?

      So loss of revenue is a non-issue?

      You have to work in a very big company to think like that, because for a small company it's definitely an serious issue: Revenue stream is already very thin as it is.

      5 hours means not only lost sales, but customers who won't come back and that's worse than temporary loss of sales.

  19. Anonymous Coward
    Anonymous Coward

    Five nines = cloud nein

    This hasn't been a great start to the year for cloud services in general. I'd consider it a wake-up call: use cloud but don't depend on it.

    I can't imagine what might cause a loss of service on this scale, but if I was a paying customer relying on AWS I'd want to find out.

  20. Anonymous Coward
    Anonymous Coward

    Ring Doorbell Home Security impacted as well

    This presumably is what took down RING.COM home security's entire cloud network, INCLUDING their toll-free support phone line. What a shame, the lack of redundancy, in this day and age.

  21. druck Silver badge
    Mushroom

    A complete and total meltdown...

    ...was experienced by my 3 and 1 year olds, when being told that Amazon Prime Video wasn't working, and they couldn't watch Mike The Knight or Paw Patrol.

    1. Phil O'Sophical Silver badge
      Coat

      Re: A complete and total meltdown...

      A suitable DR plan in this instance might be a cupboard full of DVDs?

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like