back to article AWS power failure in US-EAST-1 region killed some hardware and instances

A small group of sysadmins have a disaster recovery job on their hands, on top of Log4J fun, thanks to a power outage at Amazon Web Services’ USE1-AZ4 Availability Zone in the US-EAST-1 Region. The lack of fun kicked off at 04:35AM Pacific Time (PST – aka 12:35 UTC) on December 22nd, when AWS noticed launch failures and …

  1. Anonymous Coward
    Windows

    Elastic

    Elastic Compute Cloud sounds great. If you ignore the fact that elastic under stress tends to snap. And AWS's elastic seems to do it on a regular basis.

    While those affected have to make do with their lump of coal, rest assured that the execs are all nestled in bed while visions of sugarplums dance in their heads.

    1. Version 1.0 Silver badge
      Alert

      Re: Elastic

      It's the winter and we're having some cloudy weather ... down here on the Gulf coast this sort of bad luck doesn't seem too bad ... it just illustrates the need for backups and alternative servers when you are just hoping the clouds will not have a problem ... like the wind speed getting up to 150mph ... that can cause a cloudy fuse to blow.

    2. Snake Silver badge

      Re: Elastic

      "26 minutes later the cloud colossus ‘fessed up to a power outage and recommended moving workloads to other parts of its cloud that were still receiving electricity."

      So. What's the use of calling the system "elastic" and "cloud based" IF THEY MAKE YOU MANUALLY RECONFIGURE DURING AN OUTAGE??!

      If "cloud" were truly this magic solution to flexible workload problems, wouldn't it auto-reconfigure to a different part of their network when systems go down?

      Amazon's statement is to switch servers and you can do that yourself on your own hardware. So why pay to use someone else's hardware, one that markets redundancy and availability, if they can't even be responsible for providing that without forcing you, the customer, to handle this topic yourself??

      1. Dwarf

        Re: Elastic

        All the tools are there to allow people to configure for full redundancy or for minimal cost and everything in between, its down to the customer to choose what's right for their workload - balancing the cost and complexity against the risk and impacts against their hosted services being down.

        Those that have opted to configure using the standard redundant patterns will be OK as it will do the automatic failover as designed.

        For those that went the minimal configuration and cost route, they accepted the impacts on resilience for their decision.

        Its hardly fair to try and blame the vendor when its the customer that choses how to consume the service for their specific workload. The patterns for full resilience (and highest cost) are well publicised as are the ones patterns for cold standby backups and cloudformation to rebuild the same stack in other regions or zones. Read up on the AWS well architected framework

        There is no need for full redundancy in one physical location and in fact its never possible to be fully redundant in one place due to a variety of reasons including forces of nature, even if you could fix all the technical issues around resilience.

        1. unimaginative

          Re: Elastic

          You are right, but a lot of people do not seem to realise it.

          There is an assumption that "cloud" means someone else is taking care of everything

          1. WolfFan Silver badge

            Re: Elastic

            There’s a word describing people who make that kind of assumption. That word is ‘idiot’.

            1. John Brown (no body) Silver badge

              Re: Elastic

              Other words include naive, gullible and similar because marketing headlines. A lot of the affected people will be small businesses with little to no idea about setting up IT systems and went with AWS and/or other cloud providers because of the advertising.

              1. Martin M

                Re: Elastic

                In other words, probably the exact same people who would have screwed up on-prem.

                1. cyberdemon Silver badge
                  Mushroom

                  Re: Elastic

                  You can't "blame the customer for being stupid enough to believe our marketing drivel / lies" - The reason that there is a chronic lack of IT expertise / clue amongst small business owners is BECAUSE c*ts like Bezos' Bozos are pretending to have a magic wand that fixes all IT woes, when they don't. So nobody bothers to get proper IT training - the IT training colleges just tell them how to set up a bloody EC2 instance!

                  1. Martin M

                    Re: Elastic

                    Small business owners have lacked IT expertise/clue since the dawn of computing.

                    And yes, I do blame them if they're so massively naive as to unquestioningly believe marketing. Most people wise up to that when they're about 5 years old, put a toy on their Christmas list off the back of exciting puffery, and receive underwhelming plastic tat.

                    Luckily, nowadays most sensible small businesses don't try to train their admin assistant to juggle EC2 instances, but instead go for a collection of SaaS. Many of those are horrible, and it is a lemon market, yet it's still almost always better than them trying to muddle through themselves.

        2. Jaywalk

          Re: Elastic

          You are right! There are enough configurations, tools, and capacity to provide full redundancy. But, if you follow that religiously the cost advantage of cloud is gone. Vendors sell Cloud as a cheap and yet reliable alternative. But, in reality, making it reliable is not cheap for many companies. A lot of customers are frustrated with the hidden costs. Cloud bill shock is not a news anymore.

      2. DougMac

        Re: Elastic

        It is elastic, if you spend all the more money and engineering time to make it redundant yourself, buying additional and redundant services to make it so.

        They assume everyone is a dev, and can rewrite all their apps to fit within their model. Ie. if you don't have multiple AZs, load balancing, EBS, backup EBS, etc. etc. etc. you are doing AWS wrong.

        Whereas the rest of the real world expects to treat the AWS objects as a server in the cloud that runs as well as other setups.

        There is a giant disconnect between doing AWS right, and what the rest of the world expects.

        All these consultants coming into businesses and selling the cloud have no clue either.

      3. John Brown (no body) Silver badge

        Re: Elastic

        "If "cloud" were truly this magic solution to flexible workload problems, wouldn't it auto-reconfigure to a different part of their network when systems go down?"

        That was my takeaway from this too where AWS are quoted as suggesting people buy more than one instance/location so as not to be affected by these outages. And yet, all the AWS and other cloud providers marketing, in big headline letters, tells you how resilient they are. I suppose, in the 3pt grey on grey small print they then contradict the strong inferences in the headlines.

        1. Anonymous Coward
          Anonymous Coward

          Re: Elastic

          "3pt grey on grey small print"

          Braco comment.

      4. Martin M

        Re: Elastic

        They don’t. Most Platform as a Service products automatically fail over in the result of an AZ outage.

        If you’re using EC2 then you have to engineer your own solutions, but APIs and tooling allow you to automate almost anything. If you have to do anything manual in order to failover - other than possibly invoke a script - you’re doing it very wrong.

  2. cjcox

    New AWS service working as expected

    Amazon recently brought up their Elastic Total Failure Service and has reported that so far, it has been quite reliable. Right now the service is free for all AWS customers. /s

    "I just checked. We're down." - Joe Satisified, Important Company, Inc.

  3. Nate Amsden

    9.5 hrs of downtime

    For Chef's software stack (https://status.chef.io/), my alerts dashboard hadn't been that red in years. Fortunately I didn't have to make any chef changes today. Obviously they didn't have a disaster recovery plan(nor do most companies), they waited for amazon to fix their stuff then tried to recover what they could(at least that is what it seems like as an outsider anyway).

    The list of companies affected by these outages(amazon included) just show building apps that are resilient to such cloud failures is beyond the reach of most organizations(whether it is complexity or cost or both). I've only been saying that for just over eleven years now. Not surprised the trend continues. I moved my org out of amazon cloud in 2012 and have been running trouble free ever since with literally $10-15M+ in savings since(would of been nice to get more of that savings invested into more infrastructure but the company was stingy on everything). There was no lift and shift into the cloud, the company was "born" in the cloud(before I started even). But still many people just don't get it(that cloud is almost always massively more expensive than hosting yourself unless you are doing a really bad job of hosting it yourself which is certainly possible, though much more common to host it in cloud very poorly then hosting it yourself). I don't get how you couldn't get it at this point.

    1. unimaginative
      Unhappy

      Re: 9.5 hrs of downtime

      If at some point your in-house systems fail, you are probably going to be blamed for moving it out of AWS.

      On the other hand all AWS failures can be blamed on Amazon.

      Its much personally safer to be in the latter position. CYA.

      1. spireite Silver badge

        Re: 9.5 hrs of downtime

        Thats bloody true..... damned if you, damned if you don't

        It usually follows this plan....

        "We need to move to cloud for resiliency"

        "We don't need failover"

        "Our cloud bill is massive, move it back"

        followed by

        "System crashed"

        "Why did you move it to datacenter" (despite the fact you were told to move it)

        and despite thact a single node in the cloud is no more resilient to a single node in a traditional DC

      2. Nate Amsden

        Re: 9.5 hrs of downtime

        It's been 10 years and that hasn't happened yet. Well systems fail and VMs are automatically moved over(often times before the alerts even have a chance to get out). In fact the org has cut BACK on providing protection we used to have products that provided automatic DB failover (ScaleArc) but the decided to stop spending on those I suppose in part we haven't had a primary DB fail in maybe 6 years now? I'm not sure. I remember one time a developer building something into their app and I asked why they did it that way, they said in case the VM fails and we have to rebuild or something like that. I laughed and said that doesn't happen, not since we moved to our own little data center. It was a regular occurrence in cloud though.

        The apps certainly have failed from time to time there have been outages certainly, especially when the main app(s) enter feedback failure loops because of lack of really good testing. Those situations are pretty rare though.

        But it is true that the org has come to expect super high uptime because that is what I've provided, so when something does go wrong, which is super rare sometimes they do complain.

        There was one time in late 2012 when we were just slammed with traffic(within 1 year of moving out of cloud), we didn't have enough capacity on our app servers. One of the lead developers said, fuck it, take QA down give prod the capacity(that guy was really cool I miss him). I said, wow, ok I can do that in a few minutes if you really want. He said do it, CTO said go for it too. So I did. Powered off all of QA within 5 mins and gave the capacity to prod. Ran like that for a few weeks, they didn't care. Only had to do that once, we bought more capacity after that. There were other times where the app ran out of capacity but that was an app limit, there was no way adding infrastructure would of helped it(they eventually discovered and fixed those bottlenecks, or at least most of them, then the org scrapped that app stack(and the devs that built it left) and built a new one(with a new team) with even more issues).

        The apps for years had many single points of failure as well, even when in cloud such as using a single memcache server that was super critical. Later the new app used memcache too even though we asked them not to they went ahead anyway. You know when I said there were more issues on the new stack? They later told us that the new stack hosted stateful data in memcache and they asked us NOT to reboot those VMs for things like security updates as a result. They could recover the data but it would be an app outage during the process. They eventually moved to redis in a HA configuration but it took years. I had our memcache servers running in vmware fault tolerance, though never had a host failure take one down so FT never had to kick in.

        Big failures are rare on prem or in cloud. What on prem can help most with in my opinion is helping with small failures, which happen in cloud on a far too regular basis. Execs don't care much about those because they aren't dealing with them, whoever is managing the servers/storage are(which in my org would be my team), and that just means tons of headaches.

        Two of my switches are still around and in production from their original deployment, I checked again yesterday they were first powered on Dec 20 2011, currently one has 3,655 days of service days and the other has 3,654. Everything else from that era was retired in 2019 or before.

        I can recall of just 3 VM host failures over the past ~18 months. All 3 were the same host. I believe it is a hardware issue (DL380Gen9) but not enough evidence to determine what component(s) to replace. System behaves as if both of it's power feeds are cut at the same time which is not happening(unless it's inside the chassis). System ran fine for 3+ years before this behavior started. In the meantime I just use VMware DRS to prevent critical systems from running on the host until I have more data, so it's basically a non event when it fails.

        1. Crypto Monad Silver badge

          Re: 9.5 hrs of downtime

          Are you in a single data centre? Make sure you have a good DR plan.

          Google and you'll find plenty of incidents of power and cooling outages lasting many hours - or worse, like the fire at OVH.

          1. Nate Amsden

            Re: 9.5 hrs of downtime

            yes single data center. No DR plan. I've never worked at a company in my career (24 years) that had a workable DR plan, even after near disasters. Everyone loves to talk DR until costs come up then generally they don't care anymore. At my current company this happened every year for at least 5-6 years then people stopped asking.

            Closest one to have a DR plan actually invested in a solution on paper due to contract requirements from the customers. However they KNEW FROM DAY ONE THAT PLAN WOULD NEVER WORK. The plan called for paying a service provider(this was back in 2005) literally to drive big rig trucks to our "DR site" filled with servers and connect them to the network in the event of a DR event. They knew it wouldn't work because the operator of the "DR site" said there's no fuckin way in hell we're letting you pull trucks to our facility and hook them up(they knew this before they signed on with the DR plan). They paid the service provider I think $10k/mo as a holding fee for their service.

            That same company later deployed multiple active-active data centers(had to be within ~15 miles or something to be within latency limits) with fancy clustering and stuff to protect with DR. Years after I left. One of my team mates reached out to me joking they were in the midst of a ~10 hour outage on their new high availability system (both sides were down, not sure what the issue was I assume it was software related like Oracle DB clustering gone bad or something).

            Another company I was at I was working on a DR plan at the time, it was not budgeted for correctly, I spent months working on it. While this was happening we had a critical storage failure that took the backend of production out for several days. There was no backup array, just the primary. It was an interesting experience and I pulled so many monkeys out of my ass to get the system working again(the vendor repaired the system quickly but there was data corruption). Most customers never saw impact as they only touched the front end. I got the budget I was fighting for in the end, only to have the budget taken away weeks later for another pet project of the VP that also was massively underfunded. I left soon after.

            Current company had another storage failure on an end of life storage system, and guess what the IT team had NO BACKUPS. Accounting data going back a decade was at risk. Storage array would not come up. I pulled an even bigger monkey out of my ass getting that system operational again(took 3 days). You'd think they would invest in a DR or at least a backup array? I think so. But they didn't agree. No budget granted.

            Rewind to 2007ish, hosting at the only data center I've ever visited to ever suffer a complete power outage (Fisher Plaza in Seattle). I was new to the company and new to that facility. It had previously experienced a power outage or two, various reasons, one was a customer hit the EPO button just to see what it would do(aftermath was all new customers required EPO training). Anyway I didn't like that facility and wanted to move to another facility but was having trouble getting approvals. Then they had another power outage and I got approvals to move fast. I remember the VP of engineering telling me he wanted out he didn't care what the cost was and I was literally at the end of the proposal process and had quotes ready to go. Moved out within a month or two. That same facility suffered a ~40 hour outage a couple of years later due to a fire in the power room. The building ran on generator trucks for months while they repaired it. It was news at the time even "Bing Travel" was down for that time they had no backup site. Several payment processors were down too at least for a while.

            I read a story years ago about a fire in a power room at a Terremark facility. Zero impact to customers.

            Properly designed/managed datacenters almost never go down. There are poorly managed and poorly designed facilities. I host some of my own personal equipment in one such facility that has had several full power outages over the past few years(taking the websites and phone systems of the operator out at the same time), as far as I know no redundant power in the facility which was designed in the 90s perhaps. Though it is cheap and generally they do a good job. I wouldn't host my company's equipment there(unless it was something like edge computing with redundant sites) but for personal stuff it's good enough. Though sad that there are less power outages at my home than at that data center.

            Amazon and other hyperscalers generally build their datacenters so they CAN GO DOWN. This is mostly a cost exercise, doubling or tripling up on redundancy is expensive. Many customers don't understand or realize this. Some do and distribute their apps/data accordingly.

            As someone who has been doing this stuff for 20+ years I believe people put too much emphasis on DR. It's an easy word to toss around. It's not an easy or cheap process. DR for a "data center" makes more sense if you are operating your own "server room/datacenter" on site small scale for example. But if your equipment is in a proper data center(my current company's gear is in a facility with ~500k sq feet of raised floor) with N+1 power/cooling, and ideally operated by someone who has experience(the major players all seem to be pretty good). The likelihood of the FACILITY failing for an extended period of time is tiny.

            To me, a DR plan is for a true disaster. That means your systems are down and most likely never coming back. Equipment destroyed, someone hacks in and deletes all your data. Outages such as power outages or other temporary events do not constitute a need to have or activate a DR plan. But it really depends on the org, what they are trying to protect and how they want it protected. 99%+ of "disasters" can be avoided with proper N+1 on everything. Don't need remote sites, and complexity involved with failing over, or designing the app to be multi data center/region from the start as the costs of doing that are generally quite huge, for situations that almost never happen.

            I've been involved with 3 different primary storage array failures over the past 19 years(all were multi day outages in the end) and having an on site backup storage array with copies of the data replicated or otherwise copied would address the vast majority of risk when it comes to disasters. But few invest even to that level, I've only worked at one company that did, and they didn't do it until years after they had a multi day outage on their primary storage array. I remember that incident pretty well I sent out the emergency page to everyone in the group on a Sunday afternoon. The Oracle DBA said he almost got into a car accident reading it. Seeing "I/O" error when running "df" on the primary Oracle servers was pretty scary. That company did actually have backups, but due to budgeting they were forced to invalidate their backups nightly as they used them for reporting. So you couldn't copy the Oracle data files back, at least not easily I don't recall what process they used to recover other than just deleting the corrupted data as they came across it(and we got ORA crash errors at least 1-2 years after though they only impacted the given query not the whole instance).

            1. Nate Amsden

              Re: 9.5 hrs of downtime

              Also wanted to add, what's worse than not having backups? I wasn't sure until I learned first hand. What's worse than not having backups is NOT TELLING ANYONE YOU HAVE NO BACKUPS.

              When the most recent storage array went down several years ago everyone outside of the small IT team believed everything was backed up. It wasn't until the array did not come back up did IT management raise the issue that oh hey we don't have ANY backups of the accounting system going back TEN YEARS. I wouldn't of been upset if this was well known. But it was not. I wasn't involved in IT at the time, and well that IT director is long gone (not because of any reasons related to the incident at hand). IT team was doing backups, just none of that system because of technical limitations that again weren't communicated widely enough. We resolved those technical limitations later.

              The array in question was decommissioned in 2019 (I think ~3 years past EOL date).

              1. spireite Silver badge

                Re: 9.5 hrs of downtime

                We all get that, I'm always astonished at how the backup is down the back of a sofa that can't be found.....

                What should be done, never is , is a quarterly, but certainly no less than biannual check of back processes on ALL production systems.

            2. spireite Silver badge
              Holmes

              Re: 9.5 hrs of downtime

              Good old DR planning,

              Back in the early 2000s, one of my old employers had such a thing, lasted about 4 years. All server kit WAS in the office.

              In late Feb/early March every year, we'd troop off to a facility with backups retrieved from an offsite storage company (IronMountain).

              Our kit at the time was big black cubed IBM NetFinitys.

              What was priority? Email.... we were essentially a data company, but email was priority (Lotus notes).

              Other systems were SQL databases of a couple of vendors, SSRS.

              Of course these were bare metal restores, and the bare metal was something like HP DLxxx, or Dell 15xx. Couldn't find NefFinities for love nor money

              We'd be getting ourselves to business critical up and running in three days.

              In fact it proved a few points.....

              1. Recovery of email is NOT critical

              2. Stick to hardware that is more commonly available - not NetIndefinitely

              3. No matter how you document, or how thorough you are. It will go wrong

              4. Ensure your dev work is properly source controlled. Ours was in VSS. That s**t isn't stable on a production machine, you're doomed on a bare metalk restore frankly.

              5. Upper management cannot prioritise for toffee.

              6. Testing becomes "Can I connect, can I select a row etc".. Full test suite?? Yeah right... "We can't afford the man time for that, they have jobs to do in HO"

              and

              6. In a DR centre noone hears you scream. (good, they'd think a zombie horde was being fought)

              My takeaway ? On paper a good thing, in reality barely above pointless.

    2. portege

      Re: 9.5 hrs of downtime

      But you do remember the Facebook outage that brings 3 main services down? now they just announced to have a long-term commitment with AWS.

      I like the idea of having full ownership of my own infrastructure, but using Cloud means I share my responsibilities, resources and workload to other guys (cloud vendor), which it's makes sense if there's an additional cost for that.

  4. DS999 Silver badge

    Cloud outages are becoming a regular occurrence

    I wonder how many can occur before businesses start to rethink depending so much on the cloud?

    1. Anonymous Coward
      Anonymous Coward

      Re: Cloud outages are becoming a regular occurrence

      Maybe time for AWS customers to be on more than one cloud provider or even shift from AWS with the number of outages over the past weeks.

  5. StewartWhite
    FAIL

    Ever heard of a UPS?

    "As is often the case with a loss of power, there may be some hardware that is not recoverable"

    What a bunch of clowns! Amazon's revenues in 2020 were north of $385 billion but it can't be bothered to pay for and install a UPS or two and then when there's a power borkage restore from DR system/backups maybe? What exactly are people paying stupid amounts for when hosting in AWS?

    1. Anonymous Coward
      Anonymous Coward

      Re: Ever heard of a UPS?

      > What exactly are people paying stupid amounts for when hosting in AWS?

      Originally it was for scaling on demand but at some point that morphed into general hosting and the benefits became a lot less clear.

      1. Lon24

        Re: Ever heard of a UPS?

        Yep I remember when Rackshack lost grid power for five days when a substation exploded. Yet not one server blinked.

        The UPS wasn't perfect. The switchboard wasn't covered, there was no mobile phone coverage so calling in help was challenging. Also the standby generators had been tested individually but not all together so the cooling system couldn't cope leading to some adhoc re-engineering involving lots of plywood hastily built vents.

        But then we were paying $99/month and it was twenty years ago so things must be better nowadays ..

    2. Duncan Macdonald
      Mushroom

      Re: Ever heard of a UPS?

      The problem was probably WITH a UPS. Amazon is not stupid and will have UPS coverage for its cloud systems. Unfortunately UPS's are not 100% reliable and when a large one fails it is sometimes in a dramatic manner. (Many years ago the UPS that fed the mainframe at the company that I worked at failed - it melted a solid aluminium busbar in the process.)

      If a big UPS fails then it would take some time for the engineers to manually reroute the power connections. (Switching is easy - if any cables need to be reconnected this takes much longer - cables carrying multiple megawatts are not lightweight or very flexible.)

      Icon for some dramatic UPS failures ==========>

      1. seven of five

        Re: Ever heard of a UPS?

        >The problem was probably WITH a UPS

        Think so, too. Probably did a QVH. Or close to it.

        1. cyberdemon Silver badge
          Paris Hilton

          Probably did a QVH.

          What's a QVH?

          Inverter high-side power transistor fails short?

      2. StewartWhite

        Re: Ever heard of a UPS?

        How do you know that the problem was probably with a UPS? Nobody's provided that info as yet. I know all too well that UPS systems themselves can fail but you also have surge protectors for the circuit itself if you're hosting important "stuff" but you might be surprised at the number of cloud hosting companies who don't bother with such fripperies as DR - witness the SSP Pure insurance broking platform debacle in 2016 https://www.insurancebusinessmag.com/uk/news/breaking-news/the-ssp-outage-two-weeks-on-37592.aspx

        More importantly, IF the UPS has failed, where are the backups/what are the DR plans if the impact is this bad? These are absolute basics that cloud vendors need to be performing as a matter of course rather than getting more people signed up to their new "shiny" just so Jeff can fly to the moon.

        1. skswales

          Re: Ever heard of a UPS?

          Wrong. Where are *your* backups? What are *your* DR plans?!

        2. Henry Wertz 1 Gold badge

          Re: Ever heard of a UPS?

          "More importantly, IF the UPS has failed, where are the backups/what are the DR plans if the impact is this bad? These are absolute basics that cloud vendors need to be performing as a matter of course rather than getting more people signed up to their new "shiny" just so Jeff can fly to the moon."

          Well, that's the problem. This cloud stuff is marketed as being all redundant and failsafe and so on, but if you're paying for 1 set of compute capacity and 1 set of storage, you're not really getting any redundancy. If you buy compute and storage capacity in 2 redundancy zones, you are paying 2x the amount plus of course you have to arrange whatever is needed to keep everything synced between those zones (which could be pretty easy if some "eventual consistency" and a small amount of data loss in the face of sudden failure is fine; to difficult for order handling and billing where you really need perfect replecation but still keep performance reasonable.) So it's quite common for places to not buy this redundancy, both due to the price and due to the complexity of having the replication and failover actually working to take advantage of it.

          1. runt row raggy

            Re: Ever heard of a UPS?

            i've seen this claim a few times, and i find it confusing given that my understanding is there is generally an sla for these things. for example for ec2 instances https://aws.amazon.com/compute/sla/. can you point me to the aws marketing that says that an ec2 instance is failsafe?

            also given that most storage is ebs, and ebs has the ability to snapshot to s3, it's not correct to say there's no storage redundancy, or that it costs 2x to ensure additional redundancy. again, aws publishes sla ranges https://aws.amazon.com/ebs/features/.

            i'll ignore the active active redundance scheme as a strawman.

        3. Marty McFly Silver badge
          Thumb Up

          Re: Ever heard of a UPS?

          >"I know all too well that UPS systems themselves can fail...."

          Multiple power supplies in the servers. Each fed with a separate UPS. Each UPS fed by a separate power feed. UPS redundancy solved.

          1. Anonymous Coward
            Anonymous Coward

            Re: Ever heard of a UPS?

            Catastrophic failure of one UPS (say a battery explodes and starts a big fire) takes the others with it...probably the actual servers, too, while it's at it. Some problems simply cannot be solved with that level of redundancy.

      3. Nate Amsden

        Re: Ever heard of a UPS?

        UPSs are not 100% reliable that is true, which is why for best availability you design for N+1 at the datacenter level. Amazon would rather make redundant data centers than redundant UPSs/cooling systems. Which at some scale makes sense, but again app designers need to build and test for that, and most do not do it either at all, or don't do a good enough job(as evidence by the long list of companies impacted by such outages when they happen). The OVH situation was even worse, very poor/cheap design. Which again can make sense at super scale shipping containers were (maybe still are) popular for server deployments at one point but that's really a specialized situation not fit for general purpose computing.

        I remember several years ago at the data center we use, they sent out an email saying they lost power redundancy, maybe one of the grid links failed I don't remember. My then director was freaking out wanting to send notification to others in the company of this situation. I don't recall if he did or not, but I just told him calm the hell down. You have a right to be freaking out if one of our actual power feeds drop(everything was on redundant power). But that never happened. They repaired whatever issue it was and things were fine, never lost power at any PDU.

        We had another data center in europe which was more poorly managed. They did have a time where they actually had to shut down half of their power feeds to do something, then a week later or something shut the other half down. That was annoying. I didn't have to do anything, though our fibrechannel switches were single PSU, so one of them went down for each part of the outage but it didn't impact anything. Was so glad to get out of that Telecity data center I really hated the staff there. They didn't like me either. Telecity was much later acquired by Equinix. Had soo many issues with the staff and their policies. In the end it was me who caused them to change their policy regarding network maintenance. Until I complained loud enough they felt free to take network outages whenever they wanted without notification to customers. I just didn't believe anyone would do that but they were doing it, at one point taking half their network down, without telling anyone. They fixed that policy anyway but had tons of other issues. We moved out in 2018 I think.

        1. cyberdemon Silver badge
          Devil

          Re: Ever heard of a UPS?

          I think UPS failover (or supply failover in general) is a good case for DC power distribution instead of AC tbh. I often worry about the stability of our AC electricity grids.

          I don't know if Amazon had DC or AC distribution in this case, but I'd place a bet that it was standard 50/60Hz AC.

          The big problem with AC is that your sources all need to be in phase, so you can't easily have multiple UPSs holding up the same busbar. Normally you end up with one big inverter somewhere, which is a single point of failure.

          You can still have smaller UPSs on individual servers though, and you can have redundant PSUs inside the servers themselves which use diodes as a failover mechanism. But you can only do that after it has been rectified to DC.

          Any 'mission-critical' server like a disk array would "shurley" have double, if not triple redundant PSUs with separate AC busbars and separate UPSs though, so I'd be surprised if this outage was just a UPS failure. If so then some IT/electrical engineering heads could roll at Amazon.

      4. Mike 125

        Re: Ever heard of a UPS?

        >Amazon is not stupid

        On, that we agree. It's a company. It can't be smart or stupid.

        >If a big UPS fails then it would take some time for the engineers to manually reroute the power connections.

        That's weird- on Star Trek it happens instantly- well, usually within 5 mins or so of the end of the show.

        Hey, check this for a plan: how about Amazon configure... wait for it.... >1 UPS? Do you think that would help?

        1. Anonymous Coward
          Anonymous Coward

          Re: Ever heard of a UPS?

          Not in an UPS large enough to cover a whole massive data center. You're talking disaster levels of failure which border on Acts of God, at which point no on-site redundancy will be sufficient (as the disaster event can potentially destroy the site itself, including all redundancies within). Thus Amazon prefers multiple siting: the only practical defense against Acts of God.

        2. spireite Silver badge
          Coat

          Re: Ever heard of a UPS?

          Amazon should be able to fix it quickly. They also have Data

          1. cyberdemon Silver badge
            Facepalm

            Re: Ever heard of a UPS?

            see icon. don't come back.

        3. Conor Stewart

          Re: Ever heard of a UPS?

          There is a good chance that amazon does have more than 1 UPS in that datacentre, they probably also have it segmented up so that different UPSs power different sectors. That might explain how some hardware was damaged but a lot of it was fine. Could also be that they had issues with power for a few days and by that time their UPSs were empty, none of us will know unless amazon tells us. Could also be that one UPS failed in an unexpected way and ended up damaging their substation or triggering a failure or emergency shutoff in their other UPSs. Lots of things that could go wrong but none of us know.

          1. -v(o.o)v-

            Re: Ever heard of a UPS?

            Lots of different things can go wrong with UPS and often do based on my experience in some medium-sized Galaxy models. Contrary to old APC models that are troublefree workhorses.

            The worst part is that the single countrywide Schneider distributor is a total sham with unbelievably inflated prices, laughable response/delivery times, and bad service in general. Some behavior is borderline fraud, jacking parts prices up by offering parts for a much larger sized UPS or claiming that products are discontinued so cannot do anything to them so why don't you buy these new ones...

            I hate UPS.

            1. -v(o.o)v-

              Re: Ever heard of a UPS?

              We've actually had one of 2 critical UPS now down for way too long in one DC. It took Schneider Electric *two weeks* to come on site to even look at it. It has now been a *further one week* since that and no information about results of the site visit a week ago.

              All they say is that they will submit a report later at some unspecified date.

              I would imagine SE is not this crap in every country. But they should seriously keep their country branches in a much tighter leash, this is ridiculous.

              I wish there was a way to let the mothership know how badly their brand is being tarnished. Avoid Schneider Electric - which is hard now that they have taken over APC&MGE.

    3. David G from Visalia

      Re: Ever heard of a UPS?

      So, my organization has several UPS and we had a surprise (hard) power outage about a month ago. There is a guy in the Building Maintenance department who has the job of going to the various UPS installations (once per month) and simulating a power outage to fire up the UPS. They run the UPS for ten minutes to let it get nice and warmed up before turning it off. On this one UPS he forgot the last step "flip the switch back to monitor street power".

      By not following a checklist, he caused multiple thousands of dollars in damage, and several hundred people lost half a days wages as we had to send them home because their computers were not going to be back up until the following day.

      It's easy to imagine that AWS don't have a UPS and then call them clowns, but its much more likely that they do have a UPS and suffered some sort of unexpected failure.

      1. Nate Amsden

        Re: Ever heard of a UPS?

        It would probably surprise many, but there are some data centers that don't use any UPSs. Not to say they don't have power protection. They rely on a newer(??) technology known as flywheels that provide backup power without batteries. However these flywheels while on paper they look cool and they are nice in that they don't need batteries, their critical failing is they have generally very short runtimes (measured in seconds, maybe 30s tops). Supposed to be plenty of time for generators to kick in and take the load if everything goes smoothly.

        But what if things don't go so smoothly and human intervention is required? That's my problem with the runtime of flywheels, they don't give people enough time to react to solve an on the spot problem or even get to the location of the problem before they run out of stored capacity. Maybe that problem is perhaps the automatic transfer switch fails to switch to generator, so generator is running and ready to take the load but requires someone to go force that switch to the other position to transfer load to the generator. That just scares the hell out of me and would not want any of my equipment hosted at a facility that used flywheels instead of UPSs. I was at one facility in probably 2004, very nice AT&T facility I liked it my first "real" datacenter experience. I was in the reception area when the grid power failed. All lights and computers etc in the reception area went dark. On site tech staff came rushing in to get to the data center floor(from their on site offices had to go through reception area to get to the DC) and reassured me the power was fine on the data center floor(it was, no issues there), though they struggled to get to the floor since the security systems were down I think?, but they did get in after maybe 30 seconds. I don't know where they were rushing to exactly maybe they had to go do something to the generators! I didn't ask but they sure were in panic mode. Power came back a few minutes later.

        I wouldn't even trust redundant flywheels. I want to see at least 5-10mins of runtime available for generators to kick in. Ideally 99% of the time you won't need more than 30 seconds. I'm just paranoid though.

        My first true system admin job in 2000 I built out a 10 rack on site server room. I equipped it with tons of UPS capacity (no generators). Had two AC units too. I was so proud. I hooked up UPS monitoring and everything. big heavy battery expansion packs on our APC SmartUPS systems, enough for probably 60+ minutes of runtime. Then one day the power kicked off on a Sunday morning I think it was, I got the alert on my phone. Yay the alerts work. Then reality set in about 30 seconds later. Systems are running fine on battery backup great. But...THERE'S NO COOLING. Oh shit, I drove to the office(5 minutes away) to initiate orderly shutdowns of the systems(doing it remotely at the time was a bit more sketchy). In the end no issues, but my dream of long runtime on UPSs had a fatal flaw there..

        Flywheels were more the rage back maybe 15 years ago, for all I know perhaps the trend died off(hopefully it did) a long time ago and I just never heard since it's not my specialty.

  6. Dan 55 Silver badge
    Black Helicopters

    "there may be some hardware that is not recoverable"

    Especially the hardware used to host competitors' products in new markets that Amazon wants to enter.

    bezos white cat.jpg

  7. spireite Silver badge

    Cloud users learning the hard way........

    It's not a surprise.

    I've said it before, I'll say it again.

    Deploying to the cloud does not make you immune from failures.

    You still need to have failover in place.

    Many places I have worked in deploy VMs into Azure or AWS, as singular VMs, not even as clusters.

    Why? .... because there is a belief that it will never fail, because either provider is perceived to have it all in-hand, and of course the bean counter can't 'justify' having replicated 'hardware' if one can do the job.

  8. Anonymous Coward
    Facepalm

    Not surprised.

    I’ve never heard of a situation where moving to the cloud saved money or increased reliability. Not once.

    I can see that the cloud is good for temporary capacity or if you’re starting out or launching something new and want to concentrate on the product whilst leaving the wiring to someone else.

    But otherwise - bring it in-house! Because no-one will care about your infrastructure like you do.

    1. spireite Silver badge

      Re: Not surprised.

      Lots of truth in this.

      My defacto use case for the cloud is the need to spin up instances on demand, and then close them down when not needed.

      This is in the context of CI/CD mostly.

      Outside of that, the figures become somewhat eyewatering if not careful. I regularly see people who should know better tossing in VMs like confetti, with no planning whatsoever.

      Just recently a 64GB 32 vcpu instance was deployed, where we could have easily run what we needed in a 4GB 2 vcpu affair.

      Instantly you almost quadruple the cost FOR NO REASON.

      And that is the problem with cloud. It's viewed as an unlimited resource.

      Deploying into a DC on your own kit, you think about what you need because you can't afford to get it wrong otherwise you have kit you can't use, or you have overpriced kit.,

      For some reason 'overpriced' is never questioned in the cloud context.

      1. Dan 55 Silver badge

        Re: Not surprised.

        The only reason is accountants like Opex because they're unable to think further than three months ahead.

        1. spireite Silver badge

          Re: Not surprised.

          "The only reason is accountants like Opex because they're unable to think"

          Fixed it for you.....

      2. Nate Amsden

        Re: Not surprised.

        Funny story, back in the "earlier" days of cloud at my previous company(before I started even). There were regular occurrences of lack of capacity at amazon cloud (this was back 2009-2010ish). My manager told me there was more than one occasion where they literally had amazon support on the line as they were installing new servers, and instructed my manager precisely when to hit the "provision" button, in order to secure the resources from the new systems before others got them. I don't know if this was a routine activity or the result of the head of amazon cloud being the brother of the then-company's CEO.

        I met with the head of EC2(now CEO of amazon) while at that company(late 2010 I think) along with their "chief scientist" I think they called him, don't remember his name. They basically just spent the meeting apologizing to us for all the issues we were having and promised to fix them (manager said that was their regular response and very little ever got fixed). It was a fairly useless meeting. But hey now I can tell people that story too.

        Then there were also regular reports of "bad" instances out there people would deploy and re deploy again until they got the kind of hardware they wanted.

  9. heyrick Silver badge

    faced with the need for a sudden rebuild

    Uh, didn't people learn when OVH crashed and burned dramatically? Or are good intentions always ruined by the bean counters?

    1. ayay

      Re: faced with the need for a sudden rebuild

      Well, OVH is cheap and the scrappy underdog, so of course we'll beat them. Losers!

      Now, AWS is too big to fail, so people act like the lap dogs they are, making excuses on their behalf.

      Then we wonder why it always seem like the biggest douchebags keep getting ahead. Maybe it's because we can't possibly forgive the underdogs for any mistakes, but are willing to make excuses for the dominant players before they even have to come up with some bullshit.

      See also: AMD between Intel's Core and Ryzen, Microsoft, and many others.

  10. Plest Silver badge
    Happy

    Big whoop!

    Hey, difference now is that rather than running down to the basement to see what has blown in our own datacentre rooms, now we just watch Twitter and wait for messages about some techs in a datacentre 600 miles away to run down to their basement!

  11. Clausewitz 4.0
    Devil

    There is no Cloud

    THERE IS NO CLOUD: It’s just someone else’s computer

  12. thejoelr

    Proper failover is great...

    But it still tripped up the paging system for a long time. Some of are trying to have Christmas. It wasn't only ec2, we had console access issues as well. And then instances coming back, going away, losing their EBS but being online as EC2... Site up? sure. Pagers quiet? No.

  13. Anonymous Coward
    Anonymous Coward

    If only AWS followed their own advice on their own tooling

    What was highlighted in the previous outage is that AWS themselves have a single point of failure on Us-west-1 that their own tooling/interfaces are dependent upon.

    AWS themselves were limited on what support they could offer as they did not have access to the console or any end points their scripting is dependent upon for their own processes they have setup,

    Further AMS clients (those that have asked AWS to manage and pay for them to do so) could not request access to their own EC2 (or anything else for that matter) during this outage.

    i.e. an AMS client could have EC2's in multiple availability zones (following AWS guidance) would not have been able to request access to investigate any user query as the ability to raise an rfc (AMS requirement) itself was not possible at this time for hours.

  14. fredesmite2
    Flame

    Remember - CLOUD COMPUTING is NOTHING MORE THAN ..

    Remember - CLOUD COMPUTING is NOTHING MORE THAN ..

    using someone else's computer system .. thinking they care about it as much as you do.

  15. sainrishab

    Great! Thanks for sharing.

  16. FrenchFries!

    Hahahaahhaaaaa! Public Cloud is 99.95% retarded.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like