back to article AWS admits more bits of its cloud broke as it recovered from DynamoDB debacle

Amazon Web Services has revealed that its efforts to recover from the massive mess at its US-EAST-1 region caused other services to fail. The most recent update to the cloud giant’s service health page opens by recounting how a DNS mess meant services could not reach a DynamoDB API, which led to widespread outages. Down the …

  1. xanadu42
    Facepalm

    Another Wobbly Service...

    And how unexpected is it that there would be a cascade of additional events related to the first?

    1. Anonymous Coward
      Anonymous Coward

      Well AWS recovered after a dozen hours … but customer systems will have needed further time to sort themselves out.

  2. ICL1900-G3 Silver badge

    What a great idea

    To put your business in somebody else's hands, to pay them a premium to manage something you could do yourself, control yourself...

    1. Will Godfrey Silver badge
      Angel

      Re: What a great idea

      Well, It's all about community spirit, doncha no.

      Instead of one company having a wobbly, everyone joins in to offer 'support'

      1. John Brown (no body) Silver badge

        Re: What a great idea

        The "Blitz Spirit"?

        1. eamonn_gaffey

          Re: What a great idea

          More like collective punishment .............

    2. abend0c4 Silver badge

      Re: What a great idea

      The counter argument, that you get a bunch of highly-skilled operations staff working 24/7, a service you could not individually afford, unsurprisingly doesn't seem to have got much of an airing over the last few hours.

      The biggest problem, it seems to me, is not the outsourcing per se (there will be a significant body of customers for whom it makes economic sense in principle if the implementation is right), but the small number of providers. The potential international economic effect of such a huge chunk of infrastructure going out simultaneously is a risk that transcends any one particular customer's interests.

      These vast enterprises must be broken down into smaller units. Not only for resilience, but to encourage competition, necessitate the development of standards that allow services and data to be migrated and to restore the balance of power between consumer and provider - indeed, between nation state and provider. This is a warning that requires an urgent response.

      1. An_Old_Dog Silver badge

        Re: What a great idea

        @abend0c4:

        "The counter argument, that you get a bunch of highly-skilled operations staff working 24/7,"

        I believe many positions were trimmed, leaving behind a much-smaller bunch of less-skilled (cheaper) operations staff working 24/7.

        1. Jim Mitchell
          Holmes

          Re: What a great idea

          I believe many positions were trimmed, leaving behind a much-smaller bunch of less-skilled (cheaper) operations staff working 24/7.

          And if this was in-house, management would done the same.

        2. Graham Cobb

          Re: What a great idea

          Which is why abend' is right that we need more competition in cloud services so customers can choose their tradeoff between price, speed, reliability, support, etc.

          1. John Brown (no body) Silver badge

            Re: What a great idea

            Breaking them up without some form of legislation only kicks the can down the road. They'll just grow and merge into different incarnations of the same thing again. Just look at the so-called "baby Bells" in the USA and what they are now. A few large conglomerates with more or less defined monopoly regions and very little competition under new names.

      2. Tron Silver badge

        Re: What a great idea

        You are better off with your stuff on prem if you are operating an intranet and never connecting to the public internet. And have good back up policies. It could be armageddon out there, and your intranet would keep on humming on a UPS, back up and then neatly shut down.

        Those who outsource are better off if there is only a small number of services, not lots of services of varying sizes, or national ones (run by governments, spying on everything you do).

        With a few vendors, it is immediately apparent that there is a problem and skilled people can be thrown at it. Plus, credit to Amazon for the transparency.

        If you increase the number of vendors you dilute your chances of getting it fixed (or even noticed, if it happens at 3am). And if it is run by govt, don't expect any transparency.

        It is not unusual for individual services to fall over and for there to be no public recognition that they have for hours, sometimes days. If the big services fall over, everyone notices and fixes begin quickly.

        Anything operated on nation state lines immediately becomes a target, so that is something to avoid.

        1. JimBz

          Re: What a great idea

          In-house apps break too. Small businesses will rarely have the resources or expertise to manage and secure complex systems. They are typically way better off using cloud apps than trying to maintain a bunch of servers.

        2. Anonymous Coward
          Anonymous Coward

          Re: What a great idea

          "And have good back up policies"

          What stuns me is how few businesses have planned how they will work without the Internet. It's so bloody obvious that one day it could all get brought down by either a natural event or war. In the early days of online payment shops used to keep those manual card readers but no longer. They rely purely on low cost Internet services. A few don't even take cash now. Large businesses seem to have given up any hope of survival without the Internet. It hard to get many to even have validated backups and tested redundancy.

          So where does it leave us? If there is a world war millions may die, even without nukes, not from bullets but starvation if all the Internet links are severed. With EMP nukes it would be billions dying of starvation or fighting for the last crumbs with most electronics gone. Even if you live in a rural area people will go mad, damage crop cycles and simply take animals with no thought of replenishment unless you can get communities to work to defend it. One farmer alone, even if armed wont stop it.

          Elon Musk is deluded if he thinks going to Mars will help us survive a future nuclear war. By the time you can support a sizable colony on mars they will be involved too. But I'll concede it is a first step to humanity's resilience.

          So let's not let our stupid and ridiculous leaders, lead us into conflagrations. It's about time we stopped listening to the constant propaganda especially from the EU globalist bureaucrats who just want absolute control. The mad idiots. I'm all for strong defence and more military spending but I emphasise for DEFENCE.

          1. Anonymous Coward
            Anonymous Coward

            Re: What a great idea

            >So where does it leave us? If there is a world war millions may die, even without nukes, not from bullets but starvation if all the Internet links are severed. With EMP nukes it would be billions dying of starvation or fighting for the last crumbs with most electronics gone. Even if you live in a rural area people will go mad, damage crop cycles and simply take animals with no thought of replenishment unless you can get communities to work to defend it. One farmer alone, even if armed wont stop it.

            It doesn't even need to be a nuclear war. A decently sized Carrington Event would take down much of civilisation and have a similar effect on the human population...

      3. HereIAmJH Silver badge

        Re: What a great idea

        The big outages on cloud services aren't the problem for customers. Amazon throws everyone they have at these kinds of problems and no one goes home until it's fixed.

        The problem is when it's a smaller outage and only affects some of their clients. Then you get regular staff working on it and at end of shift, if you are lucky, it gets handed off to someone else who has to get up to speed on the problem. If you were working this problem on site your company would be sparing no resources to get you back online. With a cloud provider the urgency is based on how big of a customer you are. And note, if you spread your business across multiple cloud providers you become a smaller customer even though you are buying the same, or more, resources.

        1. Anonymous Coward
          Anonymous Coward

          Re: What a great idea

          Yes. You pays yer money and makes yer choice.

      4. Anonymous Coward
        Anonymous Coward

        Re: What a great idea

        Problem is we have a habit of behaving like sheep and rushing headlong into the latest trend. Cloud is still considered the place to be although well over the peak of hype. Combine that with a lack of professionalism in assessing architectures and environments and decisions are made on emotion and convenience. The Hyperscalers, whilst warning shit happens don't seem very upfront on their internal dependencies. If your service matters you need to pay more to avoid outages by taking more time to build it resilient.

        1. Yet Another Anonymous coward Silver badge

          Re: What a great idea

          Like the trend of having a computer on every desk that users are supposed to manage themselves rather than using a professional Data Service Bureau

      5. Anonymous Coward
        Anonymous Coward

        Re: What a great idea

        Sounds a bit anti captalist of you... Shouldnt the "market" be sorting this out? If companies lose trust in AWS with enough of these events wont customers hop over to other platforms / providers?

        Obviously it was a financial motivation rather than technical but you can look at vmwares market share predictions since the licensing changes

      6. MyffyW Silver badge

        Re: What a great idea

        Exactly @abend0c4, the problem isn't using a cloud service, but rather a significant proportion of the planet using one region of one cloud service provider.

        Breakup would be consistent with previous monopolies (Standard Oil)

        One potential solution: the cloud service industry could be regulated for certain non-negotiables (like resilience).

        Another: the cloud service industry could attest independently to the level of resilience it has (giving the option of you contracting to a less costly service at the expense of flakiness)

        What we have today is the illusion of resilience, but the real-existing experience of it falling over.

    3. Wayland

      Re: What a great idea

      Two advantages, when it goes wrong it's not your fault and 2nd loads of other people are in the same boat.

  3. Caver_Dave Silver badge
    WTF?

    How on Earth did we allow HMRC, with so much personal data about almost everybody in the UK, to be hosted by what is commonly being described under the current leadership as a rouge state?

    1. elsergiovolador Silver badge

      They probably can't see you behind a mountain of brown envelopes.

      This needs an investigation. Sounds reckless at very least.

      1. Mike 137 Silver badge

        "a rouge state"

        If it's a rouge state they're probably pink envelopes.

        1. This post has been deleted by its author

      2. Anonymous Coward
        Anonymous Coward

        Downvotes by a civil servant perhaps?

        or a politician.

    2. Anonymous Coward
      Anonymous Coward

      Because every other supplier who answered the request for tender apart from AWS and Azure shat the bed. As one of those suppliers was Fujitsu with their cloud offering (since throttled in its cradle) you can guess how badly the bed linen was soiled. Some of the other suppliers weren't as good as Fujitsu.

      1. Anonymous Coward
        Anonymous Coward

        With something like HMRC, it ought to be run in-house for security and accountability. They are a big enough organisation that they can resource it at the appropriate scale - it's not like some small company using cloudy services because it allows far more functionality than they could provide in-house.

    3. Crypto Monad

      > what is commonly being described under the current leadership as a rouge state?

      I would say more orange than rouge.

    4. lordminty

      VME?

      I thought much of HMRC's legacy systems still run on what was ICL's VME?

      I also thought that was outsourced to Fujitsu, but I guess its entirely possible some nutcase virtualised the Windows environment VME now runs on and stuffed it into AWS.

      1. omikl

        Re: VME?

        VME has been running hosted for over 25 years. Whether someone shoved it into AWS is another question.

        As I left Fujitsu in 2008: "Not my Circus. Not my Monkeys" :)

    5. Anonymous Coward
      Anonymous Coward

      We are lead by selfish leaders from the top down. It has spread the culture to the institutions of government. Maybe this is all by intent to bring about the reset but none the less the people are not being served by institutions that are supposed to serve them. We pay for HMRC as we pay for Parliament. They are supposed to be our employees but they have managed to reverse the intentions of democracy. Ask yourselves when dealing with a government institution do you feel like the boss or the bossed?

      "If" they truly served us we could get rid of 80% of the nonsense and there would be plenty of money to build professional and highly resilient systems for the remaining 20% of services that we actually need to thrive. Most of the stuff done is for control and politics. We do not need a complex tax system, we do not need digital id or digital money we already have it and it works. When they say digital they mean under their control. We are being treated as cattle, we need to be the farmers.

    6. Yet Another Anonymous coward Silver badge

      Data wasn't hosted on east-1 but a bunch of services that other services relied on were

  4. TRT Silver badge

    “Over time we reduced throttling of operations and worked in parallel to resolve network connectivity issues until the services fully recovered. By 3:01 PM, all AWS services returned to normal operations, meaning problems persisted for over a dozen hours after resolution of the DynamoDB debacle."

    Is that East Coast time?

    1. ttlanhil

      The first timestamp is in PDT, so the 3:01 is as well (unless marked otherwise, which it isn't)

      Although it seems like much of the world operates (or not, recently) on US-EAST-1 time, so...

      1. Claptrap314 Silver badge

        US-EAST-1 time is ZT/UT/GMT. They at least got that much right.

        1. This post has been deleted by its author

    2. disk iops

      Almost all support ops is run out of Seattle.

    3. WolfFan Silver badge

      Nope. We were still having problems until around 19:00 Eastern. We did not have full and complete access until 19:33 Eastern. Subtract three hours for West Coast time. Add five hours for GMT. Some of our systems were operational at 08:00 but failed by 11:00, some were dead at 08:00. So that's eight to eleven hours of no service, which could have been avoided if only our stuff was on prem.

      1. TRT Silver badge

        It's on somebody's prem.

  5. DJV Silver badge

    So, the whole of AWS is built like a line of dominoes...

    1. Anonymous Coward
      Anonymous Coward

      a maximum of two per team

    2. cd Silver badge

      Like the financial sector, it's built like a Jenga tower.

      Each block that comes out gets a bonus.

      1. dmesg Bronze badge

        Great image/explanation. Sums up so much. I will be using this in my conversations.

    3. Anonymous Coward
      Anonymous Coward

      Mmmm not necessarily but there are certainly more dominoes than admitted. But let's not forget things go wrong wherever hosted the bad thing about hyperscalers is the cross border dependencies which are not clarified. Even if they were, not all would be even known. Ideally they would move towards full regional independence so the failure of any service elsewhere only affects its region unless the client has designed otherwise. But money matters.

    4. Cliffwilliams44 Silver badge

      The entire internet, intranet, your on-prem infrastructure is ALL built on a line of dominoes known as DNS! When DNS fails, everything fails!

      Dork up your DNS in Active Directory and watch your entire enterprise come to a halt!

      All of these services and their dependencies rely on functioning DNS to work because IP addresses change as instances are spun up, swapped over, etc.

      It kills me how you server jockeys think you can provide the complex interdependent infrastructure that AWS provides On-Prem.

  6. Anonymous Coward
    Anonymous Coward

    It just goes on and on...........

    Yup...... computers are fantastic, and productive, and useful....................

    ...............until they aren't!! Just ask:

    - Jaguar Land Rover

    - Asahi

    - Cloudflare

    - SolarWinds

    And now:

    - Alexa services

    - Ring doorbell setrvices

    - Signal

    - WhatsApp

    - Lloyds Bank

    - HMRC

    I wonder if any of the accountancy types who promised that "the cloud" would be a fantastic replacement for "on premises" data centres are available for comment?

    ....or Gartner?

    1. IceC0ld

      Re: It just goes on and on...........

      [quote] I wonder if any of the accountancy types who promised that "the cloud" would be a fantastic replacement for "on premises" data centres are available for comment?

      ....or Gartner? [/quote]

      we can all wonder, but I will not be holding my breath :o)

    2. This post has been deleted by its author

    3. Anonymous Coward
      Anonymous Coward

      Re: It just goes on and on...........

      The problem is the massive dependencies that have arisen.

  7. DVG46

    Amazonian Engineers?

    “Amazonian engineers”: is it wrong that this conjures up visions of statuesque, single-breasted female technicians in white lab coats?

    1. Anonymous Coward
      Anonymous Coward

      Re: Amazonian Engineers?

      Single breasted? I was thinking of the last Wonder Woman actress when you said Amazonians. I'd better have another coffee and settle down.

  8. Mikel

    If you didn't test it...

    It's not a backup.

    That goes for operations as well as data.

    1. Anonymous Coward
      Anonymous Coward

      Re: If you didn't test it...

      After fifty years experience of making computers reliable and safe with well considered processes and procedures, we let cowboys riding their cloud take over.

      We're doomed !

  9. Anonymous Coward
    Anonymous Coward

    Just shows

    It just shows hyperscaler datacenters are not as isolated as they claim and people need to think carefully top down what is important before rushing headlong into using vendor specific services. They're great if a day's outage doesn't matter too much but I heard UK Gov's OneLogin was affected - doh! Don't you worry about Digital ID, it will keep you safe from living. The hyperscalers need to be VERY clear about what is reliant on global services and how so people can make sensible decisions. To not be transparent will limit their market.

    1. Anonymous Coward
      Anonymous Coward

      Re: Just shows

      All cloud hosted services should advertise who is hosting them. Responsibility lies with the purchaser of these services.

  10. xyz Silver badge

    The more I read about this the...

    more I think Jesus H Christ... spaghetti architecture.

    1. Anonymous Coward
      Anonymous Coward

      Re: The more I read about this the...

      It's a Gordian knot.

    2. Cliffwilliams44 Silver badge

      Re: The more I read about this the...

      Let me corrupt all the SRV records in your internal DNS and see how long your wonderful on-prem infrastructure stays working!

      The problem was not the cloud, it was not the architecture, it was a human fucking something up! Which can happen on-prem just as well as in the cloud!

      1. Excused Boots Silver badge

        Re: The more I read about this the...

        True, if I cock up a load of DNS service records, down goes my on-prem infrastructure.

        Except that’ll just be ‘my’ on-prem infrastructure, not yours, not Ring, not Snapchat, not Single etc.

        And not a significant fucking chunk of the global infrastructure. That’s the difference, ‘embrace the cloud’ is pushed and marketed as ‘vastly more resilient and reliable than going it yourself’, which is probably true, although the more, err, delusional, advocates will claim it’s ‘perfect and never goes wrong, promises 100% reliability’. Which, of course is complete bullshit, but it impresses CEOs!

        Obviously the claim that everyone is distributed and has no single point of failure is demonstratively wrong? Hypothetically North Virgina suffered from an earthquake, volcanic eruption, meteor impact, and US-EAST-1 isn’t coming back in the near future, or it was a simple human error but fixing it took a lot longer than 12 hours, then what? Why does some UK government sites go down, even though they are hosted in local data centres, because of an issue in N Virginia?

        This shit has not actually been properly thought through, has it?

  11. Will Godfrey Silver badge
    Holmes

    If nobody completely comprehends it.

    It's either too big, too complicated, or both.

  12. jackofclubs

    what, me worry?

    Just an old gray tech pensioner here, waiting out my last days on the net and learning so much, as always, from all the Reg comments. I was a marketing sod who worked inside some Sacred Holy Engineering companies, now long gone or zombified beyond recognition, such as Honeywell Information Systems (did Engineering Design Reviews on Multics terminals), Digital Equipment (Ken Olsen was a hoot even if he pretended to be an Ogre), and LTX (once had Sol Max explain to me how to measure timing signals using a delay line). I took the trouble to get my CCNA because I eventually realized that the real Alchemy was done in the network, and learned to love IPv4 and doing subnets manually. So my question, dear brothers and sisters, is what happened to the OSI 7 layer model and having everyone talk together peacefully over this internet thing. We learned all about BGP and DNS etc etc and how fucked up it all is, but it more or less worked, right? Is this the lesson of AT&T, Bell Labs, and SS7 all over again? I can pick up a phone in Nutley NJ and call my cousin in Dublin, right? Ok, ok I know, it's all about scale, and only AWS can solve that. Or is this the pseudo-philosophical battle between SNA and Decnet Phase V? For the love of Cthulthu can anyone get us back to just talking to each other, with some packets?

    1. David Hicklin Silver badge

      Re: what, me worry?

      Sadly it has all been sacrificed on the Altar of Profits

  13. Bryan W

    The cycle continues

    Cloud only where it makes sense. That's about 5% of things. Ppl I work with are laughing all the way to the bank today and yesterday.

    Hey ya'll!

    If you think "cloud" is just the bee's knees, then you're really gonna love you some of this "AI chatbot!"

    Buy buy buy! you mindless consumers.

  14. John 61
    Mushroom

    Marketing

    Let's have a global network of computers with access to all information ever created.  Let's store personal details on these machines.  Auntie Ethel who is in her 90's suddenly needs a degree in computer science to work these things, as does 7 year old Herbert.  Let's have VPN's where you can appear to be in another country.  You can view stuff that's not intended nor relevant to you!  Online banking/bill paying! Entertainment  and sports of your choice anywhere in the world, available 24/7!  Binge watch TV! Remote Access to any of these machines! Make your home smart, for ultimate convenience!  Send messages to family and friends wherever they are in the world!  Blue light keeping you awake when you should be sleeping! Simple password access! Share pictures of your dinner with the world!  Be liked by everyone!  No bad stuff! Great ideas with the best of intentions!  What could possibly go wrong?

  15. BartyFartsLast Silver badge

    Obligatory

    So, how's that cloud thing working out for you?

  16. Bucoops

    Avoiding "cloud" as long as possible.

    We are under pressure from our owning company to get moved off prem and onto cloud based asap. I'm dragging my feet somewhat.

    We did recently moved from Sage on prem to Xero. Guess which cloud provider hosts Xero....

    Our MSO that we use for 3rd line support were dead in the water as their monitoring, ticketing and VoIP are all hosted on AWS. Some of our sister companies have custom apps, many of which it seems run on AWS.

    We had an urgent heads of IT meeting, where there was a lot of flapping going onm I just stayed muted and carried on working as normal. The overall head of IT at the owning company did ask if I was feeling smug. I just said that accounts are annoyed as they can't access Xero, but apart from that, no not smug as pride comes before a fall.

    I just reminded him that there is a few good reasons we haven't moved from on prem and aren't rushing to do it now. It was actually nice to have a quiet day without being bombarded with data export requests etc.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like