back to article AWS Frankfurt experiences major breakdown that staff couldn’t fix for hours due to ‘environmental conditions’ on data centre floor

A single availability zone in Amazon Web Services’ EU-Central region (EUC_AZ-1) has experienced a major outage. The internet giant's status page says the breakdown began at 1324 PDT (2030 UTC) on June 10, and initially caused “connectivity issues for some EC2 instances.” Half an hour later AWS reported “increased API error …

  1. David 132 Silver badge
    Happy

    [citation needed]

    From the article: And as humans need oxygen

    See, that’s just shoddy journalism. Wild assertions, not backed up by any references or explanation. Pure speculation.

    Poor show. I shall be canceling my subscription forthwith, etc etc.

    1. swm Silver badge

      Re: [citation needed]

      "From the article: And as humans need oxygen"

      Clearly controlled tests are needed.

      1. KarMann
        Holmes

        Re: [citation needed]

        Controlled tests, and quicklime. Lots of quicklime.

      2. NoneSuch Silver badge
        Coat

        Re: [citation needed]

        I thought Frankfurters were supposed to be hot.

        (ahem...)

    2. Velv
      Joke

      Re: [citation needed]

      Humans don't need oxygen, it's just highly addictive. One breath and you're hooked for life.

      1. Ozumo

        Re: [citation needed]

        And the withdrawal syndrome is invariably fatal.

    3. anothercynic Silver badge
      Alien

      Re: [citation needed]

      I know, many of us sysadmins are not humanoid in nature, but... ;-)

  2. Yet Another Anonymous coward Silver badge

    environment will be safe for re-entry within the next 30 minutes

    Currywurst ?

  3. PeteS46

    And when a fire does break out in the affected area? With an off-lined protection system? What then occurs?

    1. The Mole

      Agree, but I imagine part of the reason it is still offline is it needs resealing and refilled with new gas.

    2. Graham Cobb Silver badge

      No fire suppression system is perfect. It is a business decision whether to resume operation with the system offline. As there is no reason to think a real fire is imminent, and the system has (presumably) not been activated for a long period of time, it seems a sensible decision to resume operation and reenable the suppression system as soon as reasonably possible.

      1. Robert Carnegie Silver badge

        I think that British buildings which were found to share the unfortunate feature with the notorious Grenfell Tower of being clad in candle-wax were told to get guards in to patrol through the place watching for fires, till it could be fixed. And as far as I know, they still have the guards. So... if the data centre supports human life now, then they may do that.

  4. Ken Moorhouse Silver badge

    As soon as the fire brigade arrives...

    ...they are in charge, you can kiss any grand plans you have goodbye.

    Demonstrates the fact that resilience is not just about pc-pc resilience, it is something much bigger than that.

    1. Anonymous Coward
      Anonymous Coward

      Re: As soon as the fire brigade arrives...

      A client had that with the London bus bombings

      They had a fantastic hot site a few streets away - which the police also cordoned off and wouldn't let them into.

  5. Anonymous Coward
    Anonymous Coward

    "failure of a control system which disabled multiple air handlers in the affected Availability Zone."

    So, redundant control systems are necessary.

    1. Graham Cobb Silver badge

      No - that is what an Availability Zone means: no redundancy promises within the zone. If you need redundancy, pay for a second Availability Zone.

      1. Ken Moorhouse Silver badge

        RE: pay for a second Availability Zone.

        But then, in theory, there could be an internal replication problem between Availability Zones, meaning data is out of chronological sequence.

  6. Pascal Monett Silver badge

    Once again, Single Point of Failure failed

    It's interesting that, in an industry that can have just about redundant everything (switches, servers, firewalls, you name it), it would appear that nobody bothered to plan redundant aircon (at least, not that I can tell from the article).

    I know aircon is expensive, but now the question is : how much more expensive is a day of 100% downtime ?

    You might want to design a second aircon system as backup, just to be sure.

    1. Anonymous Coward
      Anonymous Coward

      Re: Once again, Single Point of Failure failed

      Cooling systems do tend to be resilient - where I work is not a big budget kind of place and we have A+B cooling in our datacentre. You can still get a commo-mode failure even in a resilient system - e.g. if a control is set incorrectly.

      1. Ken Moorhouse Silver badge

        Re: Did you say Commode Failure?

        Is this what the Victorians talked about prior to the invention of the modern-day fan?

        (See Airplane for a visual demonstration).

    2. Charlie Clark Silver badge

      Re: Once again, Single Point of Failure failed

      You might want to design a second aircon system as backup, just to be sure.

      I'm not sure that that's directly practicable. What you need is a resilient aircon that might have different pumps on different power circuits but a completely separate airflor is difficult.

      And, at some point, all that extra redundancy means additional complexity, particularly when it's at a single site: load-balancing across separate data centres, or at least buildings on a site is probably easier.

      I did see an alert from one service I know of (but don't manage) but the ops team said there was no downtime.

      1. Ken Moorhouse Silver badge

        Re: all that extra redundancy means additional complexity

        That goes with the territory. Anyone providing Cloud infrastructure will (should?) be intimately familiar with the risks.

    3. The Basis of everything is...

      Re: Once again, Single Point of Failure failed

      There are a lot of things that could go wrong within an Availability Zone that could cause an application to fail.

      If you want your application to survive anything bad enough to take out zone, then you need to architect it to use at least two zones, and it's your problem to make sure you've addressed every component and service needed to make it work.

      If that's still not good enough for you, then you need to look at using two separate regions, again making sure to address every component and service needed to make things work.

      It quickly gets complex and expensive to cover off every possible risk. And the more complex it gets, the more the risk of something going wrong.

      And if there's an event bad enough to take out an entire region, of a level more than a comms glitch, do you really think your staff are going to be rushing to sort the problem or frantically combing though the rubble looking for their loved ones?

  7. Logiker72

    European Alternatives to Oligopoly

    Hetzner

    OVHCloud

    1&1

    And quite a few more, according to https://www.websiteplanet.com/fr/web-hosting/

    I used Hetzner and OVHCloud. Both worked very nicely and reliably.

    And yes, always have a suitable backup strategy. Data centers do burn down then and now. You need at least three copies of each important record/file. Each copy in a different location or preferrably, a different service provider.

    1. mhoneywell

      Re: European Alternatives to Oligopoly

      OVH Cloud - have you missed something?

      https://www.theregister.com/2021/03/10/ovh_strasbourg_fire/

      To be fair, they are now probably a step ahead of AWS in terms of understanding the implications of DR.

  8. UrbaneSpaceman
    Big Brother

    All the smart switches in my house stopped working,

    I had to get up and press a button to switch off the bedroom light - LIKE A SAVAGE!!

    1. AndrewB57

      Back to The Dark Ages for you

      I'll get my coat

      1. TimMaher Silver badge
        Coat

        I’ll get my coat

        Looks like it’s not where you hung it up.

        It’s over there—————>

  9. Bitsminer Silver badge

    cooked datacentre?

    Now that all their equipment has been stress-tested to, what, 45C, 60C while powered on I wonder how long it will be before high failure rates happen.

    Note that even (or especially) if you cut power at 40C ambient the internal temperatures still rise due to heat flow from the memory sticks and CPU modules.

  10. Anonymous Coward
    Anonymous Coward

    "the building needed to be re-oxygenated"

    Ah, ye'll be wanting to open a window, so ye will.

  11. aldolo

    fart-in-a-jar martin works there

    now his turn is over

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like