back to article Google, Oracle cloud servers wilt in UK heatwave, take down websites

Cloud services and servers hosted by Google and Oracle in the UK have dropped offline due to cooling issues as the nation experiences a record-breaking heatwave. When the mercury hit 40.3C (104.5F) in eastern England, the highest ever registered by a country not used to these conditions, datacenters couldn't take the heat. …

  1. TimMaher Silver badge
    Windows

    I’ve had one of those.

    During an exceptionally hot summer in two thousand and “cough”, I was running a network infrastructure for a company down in Kent.

    As nothing was “designed” I had to look at a full height cab containing a stack of edge switches that were massively over heating.

    The answer? Leave the doors open and put a fan on the floor.

    Sad really. But... hey... it worked.

    1. Anonymous Coward
      Anonymous Coward

      Re: I’ve had one of those.

      During the same period, we had a DC in a residential area, yes, this was an ex-telco building being recycled as a DC, hence was it was in a residential area.

      There was no onsite personal, this was all un-manned.

      We once walked in the facility, only to see all doors were wide open, in order to ventilate the kit. Even the local dog could have gone there and piss all over the expensive kit !

      Those were the old times ...

  2. steamnut

    Ooops

    Just as the NHS is absolutely overloaded, the dutiful Oracle takes their cloud away....

  3. Nate Amsden

    cooling failure?

    That's why there is backup cooling system right? Oh, wait maybe not there because Google and Oracle(and other big IaaS players) cut corners on their systems(they do this intentionally to reduce their costs). In this case perhaps Google and Oracle share a common co-location or something. (obviously not all co-locations are created equal, most are pretty crappy but the biggest players generally have good setups at least in situations where they built themselves rather than acquired some smaller player's assets).

    Just keep that in mind for folks that think these cloud providers choose top tier setups for their data centers (some techies(surprisingly few IMO) will already know this is not the case and if you use these providers you should plan for facility failure).

    This situation reminds me of one of my earliest jobs 22 years ago, I had a 10 rack server room with a 2 ton HVAC and also a 4 or 6 ton HVAC. I lived about 2-3 miles from the office. I had what I thought was a great power setup lots of big UPSs lots of runtime(60+ minutes with battery expansion packs). I went through great pains to setup a combination of APC PowerChute for Unix/Windows as well as Network UPS Tools(NUT). I was so proud. It ran great.

    One Sunday morning(still in bed) I get a text on my phone saying UPSs are switching to battery. I was happy everything was working the way I intended. Until about 30 seconds later and I realized the cooling system had no power. So I rushed to the office to perform manual shutdowns on stuff. Nothing lost, nothing damaged. But I learned a good lesson that day. That was my only position where I had an on site server room I was responsible for everything since has been co-location.

    1. Anonymous Coward
      Anonymous Coward

      Re: cooling failure?

      To be fair, I remember seeing some AWS documentation that said you should replicate your services in at least two zones (DCs) within a region (preferably more regions as well). They expect that individual zones will completely fail on occasion. Zones are supposed to be set up to be completely separate. So, every zone should have separate resources (power, internet, etc.) and be placed far enough apart that a hurricane can't come through and wipe out two zones. As long as each zone is completely independent of the others and your service fully resides in at least two zones, then your service should continue running just fine.

      But I don't think the documentation discussed preparing for heat waves large enough to cover the entire region. Deploying to multiple regions is usually to address latency or data sovereignty concerns.

      1. Strahd Ivarius Silver badge

        Re: cooling failure?

        but since anyway everything (and especially DNS) revolves around a bunch of servers located in Langley, Virginia, when these go down everything goes down...

        1. Mobster

          Re: cooling failure?

          Actually DNS is not hosted in a single zone or even a single region by Amazon.

          1. cawfee

            Re: cooling failure?

            Route 53 (the worlds greatest database imo) is a *global* service

      2. Nate Amsden

        Re: cooling failure?

        yes in cloud terms using a different zone is supposed to be a different data center even if it is near by(which goes to my comment as to plan for facility failure). Though history has shown in several cases that cloud issues can(certainly not always) impact multiple zones in a region.

    2. hoola Silver badge

      Re: cooling failure?

      I think that there is a possible misunderstanding, the issue is not so much that there are cooling failures (and there has been) but more critically, the cooling has been unable to cope with the extreme heat. Bringing in air that is already over 30 degrees to then use it to chill water or whatever the solution is has a massive hit. Just switching to a +1 cooling solution does not help. There is unlikely to be the capability of running primary and backup together as the supply will not cope at the load needed.

      As others have said, just open the rack doors, this does not work in most modern data centres as all you do is increase the ambient temperature in the cold aisle. It may help in a very localised situation but at the scale cloud providers are running, will not be possible.

      And as the article states the goal is to avoid frying equipment or compromising it's lifespan by running at or over designed temperature limits. There simply isn't the option to replace fried servers and storage at the moment. There are already capacity issues so what other options do they have?

      I supposed one could argue that more cooling capacity should have been specified at the site but at the moment these events are not regular enough to justify the increased costs.

      Costs that the consumer of the services will have to cover.......

      There will be companies all over that are having the same issues, it is just not make headlines.

      We had to reduce the available HPC capacity by 50% to provide some service rather than cook the cluster. At least it is still working.

      1. Nate Amsden

        Re: cooling failure?

        Having extra cooling should certainly help. It's not as if there was wide spread reported data center outages across UK or even Europe. Even if they were not reported such outages would show up in services going down across the board and that was not the case at all.

        So this seems to be a fairly isolated incident likely a single facility that has both google and Oracle in a co-location.

    3. Anonymous Coward
      Anonymous Coward

      Re: cooling failure?

      There may well be a backup cooling system, but if the hardware for the primary can't handle the heat why would the backup suddenly be able to handle it? These are unprecedented temperatures for the UK and neither system were probably specced to such high levels.

      The Cloud providers are also very clear that you should be building fault-tolerant systems, as no infrastructure is immune to failure.

      On AWS and GCP for example, a region is composed of multiple zones, and you should design applications to tolerate a failure of an entire zone. If you're running a single VM in a single zone, then you're going to get bitten; if your app is really that important, multi-zone is a must. They make this easy to do, e.g. with auto-scaling groups.

      The issues here only affected a single zone, so if that VM was either clustered across zones or set to automatically reprovision in another zone in the event of failure, you would've been unaffected.

      1. Nate Amsden

        Re: cooling failure?

        If you have N+1 cooling, and you're driving your cooling pretty hard bringing up the 3rd cooling unit(the redundant one) should dramatically reduce the load on the other two units (for example). Less chance of failure. But if they can't handle a cooling failure then that says bad facility design to me regardless of outside temps.

        Certainly such a design can be intentional and accepted by the customer as a risk for a lower cost setup. The problem is most lower tier customers have no idea such compromises are made and are caught off guard by the outages(so in the end it's a customer education issue, but one that the "cloud" players do a lot to downplay in their operations).

  4. Youngone Silver badge

    The lost art of conversation

    I'm prepared to bet a whole pound that I know what people are talking about in Britain right now.

    1. Dinanziame Silver badge
      Joke

      Re: The lost art of conversation

      Low bar — British people always talk about the weather

      1. Diogenes
        Joke

        Re: The lost art of conversation

        The problem is that while everybody talks about the weather, nobody actually does anything about it.

        1. Potemkine! Silver badge
          Trollface

          Re: The lost art of conversation

          Eradicating China, the US and India isn't that easy you know.

      2. This post has been deleted by its author

      3. Youngone Silver badge

        Re: The lost art of conversation

        Well, yes. A whole pound though.

    2. gandalfcn Silver badge

      Re: The lost art of conversation

      The difficulty of having a J Arthur?

      1. Peter Gathercole Silver badge

        Re: The lost art of conversation

        No need to be crude!

    3. 3arn0wl

      The Great British weather

      Another - somewhat unexpected - reason for not doing computing in someone else's cloud. :/

      1. Brewster's Angle Grinder Silver badge

        Re: The Great British Server Bake Off

        You mean it takes away the fun of running around* trying to stop on prem servers from overheating?

        * Running around creates a large, convection current which may stop your servers dying and/or prevent you having to shut down vital money making services.

  5. Still Water

    Slough?

    Assuming this might be located in Slough, given Oracle, and given the same heads up we got at work today about a similar cooling failure at the same time...

    1. Anonymous Coward
      Anonymous Coward

      Re: Slough?

      Buckingham Avenue by any chance?

  6. Gerard Krupa

    Amazon's servers in the UK are so powerful that they managed to fall over due to a thermal event on the 10th, a full week before the heatwave.

    1. Anonymous Coward
      Anonymous Coward

      That is the power of predictive maintenance

  7. Anonymous Coward
    Anonymous Coward

    The times they are a changin’

    Move the server farms to the Shetlands.

    1. Anonymous Coward
      Anonymous Coward

      Re: The times they are a changin’

      Move the server farms to the Shetlands

      I second that. Just been there, lovely cool place 12deg.c all year round.

      1. Paul Crawford Silver badge

        Re: The times they are a changin’

        Just rather brutal weather in the winter.

        1. Fred Flintstone Gold badge

          Re: The times they are a changin’

          Even better from a cooling perspective ;)

      2. Michael Wojcik Silver badge

        Re: The times they are a changin’

        And there's that nice Jimmy Perez fellow who comes around if anyone's murdered, according to a documentary we see here in the US.

    2. Mr Larrington
      Flame

      Re: The times they are a changin’

      This Unit went there in 2019 and got sunburned...

    3. Anonymous Coward
      Anonymous Coward

      Re: The times they are a changin’

      > relocate to $coldplace

      No. Locate to high population centres.

      Capture/use the excess heat, eg in some municipal system

      Ditto renewables. Capture excess leccy into energy storage (eg Finland using 'sand batteries')

  8. Pascal Monett Silver badge

    "As a result of unseasonal temperatures in the region"

    Might be looking to start considering those temperatures as seasonal.

    So start enhancing your cooling operations.

    1. hoola Silver badge

      Re: "As a result of unseasonal temperatures in the region"

      And the increased costs of subscriptions..........

      If people want reliability money is key. Cloud providers work by trimming to the lowest spec to keep costs down so that they are cheaper (ha ha.....) that on prem!

      1. Peter Gathercole Silver badge

        Re: "As a result of unseasonal temperatures in the region"

        It's actually worse than just them trimming to the lowest spec, their pricing is set so it is only just on the surface cheaper than on-prem. provision (and even then, I think that a number of organizations are eyeing their cloud bills with some consternation, trying to identify the savings they were promised).

        Once you factor in designing the service to be multi-zone, the affordability equation changes quite markedly.

        I know. You should be making services site failure tolerant anyway, but if you control the complete infrastructure from building, plant and infrastructure, you have a better way of ensuring that you adequately spec. the installation.

        What was the Azure story a few weeks back? MS had a problem in one region where they could keep existing load running, but could not spin up new images that were not already executing? What would happen if you were muti-zone, or even multi-cloud, and your fallback resided in that affected zone? Think you could spin up your backup zone with no resource?

        Service designers have been lured into thinking that there is always spare capacity in the cloud. Recent events seem to suggest that this is not the case, and the cloud is actually finite! Whodathunkit!

        The cloud is just another person's computer (and one where you have no say in how it's installed).

      2. Anonymous Coward
        Anonymous Coward

        Re: "As a result of unseasonal temperatures in the region"

        > Cloud providers work by trimming to the lowest spec to keep costs down so that they are cheaper (ha ha.....) that on prem!

        Unless your on-prem is gold plated, Oracle OCI most definitely isn't cheaper ;-)

    2. adam 40 Silver badge

      Re: "As a result of unseasonal temperatures in the region"

      unseasonal temperatures?

      unseasonal would be snow! This is definitely seasonal! As evidenced by BBQ's in the seasonal aisle in Tesco's.

      1. Michael Wojcik Silver badge

        Re: "As a result of unseasonal temperatures in the region"

        Agreed. These are unusual, certainly, but this is the season for 'em.

        I also thought the claim "the highest ever registered by a country not used to these conditions" is a bit odd. I see what Katyanna means, but taken literally it's basically "these are unusual in places where they are not usual". Or "it's never been this hot in places where it's never been this hot".

        This is in no way meant to diminish the severity of the heat wave and its effects in the UK. I recall days as a child in Massachusetts where the air temperature got into the high-30s (°C), and we didn't have air conditioning (which I've never liked anyway), but the Boston area has only approached 40°C a couple of times since record-keeping began. We had hotter days when I lived in Nebraska, but there we had A/C. Heat that high is definitely dangerous for people and equipment.

  9. Anonymous Coward
    Anonymous Coward

    Google distributed datacentres

    I thought that Google Search replicated data across multiple locations, it looks like their cloud servers are single point of failure. Last night a friend was grumbling that they couldn't get into either ocado or waitrose, maybe that was the same issue.

    1. Ken Moorhouse Silver badge

      Re: Last night a friend was grumbling that they couldn't get into either ocado or waitrose

      Sainsburys was open, but had a bit of a struggle carrying the water home.

      1. John Robson Silver badge

        Re: Last night a friend was grumbling that they couldn't get into either ocado or waitrose

        Stop using a sieve to carry it then.

        Taps work quite well, not aware of anywhere using standpipes at the moment.

        1. Ken Moorhouse Silver badge

          Re: Stop using a sieve to carry it then.

          No chance: I've been warned on here often enough about leaky buckets

        2. Anonymous Coward
          Anonymous Coward

          Re: Last night a friend was grumbling that they couldn't get into either ocado or waitrose

          Shh - don't tell everyone that UK*** tap water is good enough to drink. You'll dent the profits of the firms that bottle the stuff from the same sources. IIRC Coca-Cola had to suspend/close one of its UK bottled water plants due to product contamination.

          ***for other countries YMMV. It may also change in the UK as the current government appears intent on relaxing regulations.

          1. David Hicklin Bronze badge

            Re: Last night a friend was grumbling that they couldn't get into either ocado or waitrose

            > Coca-Cola had to suspend/close one of its UK bottled water plants

            They had to close it a they were caught out reselling bottled Thames Water tapwater

          2. John Robson Silver badge

            Re: Last night a friend was grumbling that they couldn't get into either ocado or waitrose

            Don't worry, the regulations preventing people from selling bottled water with a floating turd will also be gone in the name of profit.

  10. Anonymous Coward
    Anonymous Coward

    Siteground hosting

    I use Siteground, they use Google cloud.

    They advised me "We are happy to inform you that everything appears to be fully operational and all services are up and running.". No they're not, now offline for 16 hours.

  11. Mike Parris

    Fried Hardware

    It's looking bad.

    5 of my 6 sites running on Siteground have been down since yesterday and they cannot give me any ETA on a fix

    Siteground are using backups to transfer sites to their Amsterdam server farm.

  12. Anonymous Coward
    Anonymous Coward

    Back in the 1970s - a newly built computer room in Africa was having over-heating problems. The design checked out ok for needing only two AC units - but another unit was eventually installed to solve the problem.

    The AC units were very tall. They took in air at the top at ceiling level and then blasted refrigerated air under the false floor - which had vent tiles distributed round the room.

    On our visit to upgrade some software the mainframe operators were very hospitable. They offered us a cold beer. Then they lifted the floor tiles at the base of one of the original AC units - to reveal their stash of beer and watermelons ranged against the air outlet.

    1. Anonymous Coward
      Anonymous Coward

      On the plus side, they did have at least their BOFH priorities right ;)

  13. sketharaman

    AC/DC

    I get why 40C can be a problem in homes and even offices that don't have fans or AC. But Data Centers are supposed to have AC. If ACs in DCs in Middle East and India can support temperatures as high as 50-55C, why don't the ones in Oracle and Google DCs in UK support 40C?

    1. Anonymous Coward
      Anonymous Coward

      Re: AC/DC

      Structural things are designed to meet the known constraints of their expected environment. Over-engineering - apart from a safety margin - is deemed uneconomic. The UK having 40C is not only a record but was a jump of several degrees over the last record high. Usually any new record high has been measured as only a small fraction of a degree.

      It's like flood provisions. The designs usually allow for 50 or 100 year exceptional conditions which they may or may not try to handle. Not only are UK summers getting hotter - but the same climate change factors are increasing the amount of rain that will be dumped in a short period. Hence recent flooding becoming more severe and happening more often.

    2. werdsmith Silver badge

      Re: AC/DC

      We spec stuff to cope with the expected requirement trying not to spend money on stuff that is not needed.

      In my life this country only became a 38/40 C (100+F in old fashioned units) in the last 2 or three years. On occasional brief peak to 35C was more normal. Yesterday we were at 39 / 40 for a few hours. The hair dryer breeze was quite the experience. I just went to sleep.

      Anyway, after the data centre is designed with sufficient cooling and UPS capacity, 5 years later more kit has been added without consideration for the supporting infrastructure. How many times have I seen this? Lost count. The solution is often to choose some non essential stuff and switch it off, keep going like that until the only thing left on is kvm switch.

      My local newspaper website is offline, probably because of heat.

  14. Surrey Veteran

    Not hot swap ...

    We use to do the joke to interns that they Oracle queries were slow due to hot weather :)

    Jokes apart and leaving the excellent discussion about DC design above, I wonder if also is not an issue of the Application's Architecture?

    As supposed most cloud providers provides things such as geo replication and any good DevOps would even be able to automate the fail-over procedure to an alternative region if you lost an UK DC. Of course in most cases is not automatic and is something that you need to factor in your Architecture and DR.

    1. Anonymous Coward
      Anonymous Coward

      Re: Not hot swap ...

      Alternative regions would not be an option for customers who have legal restrictions to keep their data within a particular jurisdiction.

  15. ThatOne Silver badge
    Devil

    My airport is melting

    > Luton Airport being temporarily closed as well due to a melting runway

    Wow. The UK is really not ready for Weather 2.0.

    1. Anonymous Coward
      Anonymous Coward

      Re: My airport is melting

      Roads, runways, railway tracks, buildings, bridges - all have been specified to meet likely environmental conditions in the UK. Apparently it is quite tricky to make a road surface that can handle extremely high temperatures - but that will not fail in winter.

      1. Anonymous Coward
        Anonymous Coward

        Re: My airport is melting

        Really? I drive on such every day.

        I personally am appalled at thinking the UK will have to import that expertise from Texas and elsewhere in the American south. Though, we'll all soon be consulting Arabia for fashion wear.

        1. ThatOne Silver badge

          Re: My airport is melting

          Was going to say the same. There are lots of places on this Earth which get quite toasty in summer and rather frosty in winter, and most of them seem to have solved that problem somehow.

          Road surfaces melting in hot weather was common in the early 20th century when they still used raw tar to cover streets.

        2. Fred Daggy Silver badge
          Coat

          Re: My airport is melting

          Don't worry, the Minister for Administrative Affairs has a meeting at the Qumrani embassy soon, he'll be able to advise after that.

      2. werdsmith Silver badge

        Re: My airport is melting

        Apparently it is quite tricky to make a road surface that can handle extremely high temperatures - but that will not fail in winter.

        You forgot to add “on the cheap”. Of course the road engineers can do this, but the politicians won’t want to pay for it.

        The M3 down towards Winchester Southampton has a section that seems to be concrete and heat resistant, similar to some of the roads in USA where it gets hot. It’s not good for tyre noise though.

      3. Anonymous Coward
        Anonymous Coward

        Re: My airport is melting

        Juding by the roads in Malta a few years ago you don't really need to cool things down to winter temperature to have that problem, the heat itself was already enough. The only reason cars didn't disappear in some of the holes was because there was already a car in them.

        That said, if you then add the expansion of water being frozen to it (assuming we will still have winters) you're dealing with a whole new ballgame. I have nothing but respect for the people who solve these problems.

      4. David Hicklin Bronze badge

        Re: My airport is melting

        > Apparently it is quite tricky to make a road surface that can handle extremely high temperatures

        Apparently this only affects the older road surfaces, more recent ones are supposed to be specced for higher temperatures.

  16. FlamingDeath Silver badge

    Duh error

    Internet, was supposed to be resilient

    Replication, redundancy, routing, etc

    God help us if we did actually experience a nuclear attack.

    What a joke this species is, am I being too harsh??

    “I Like Money” - Frito

  17. Pembroke

    More heat to come?

    Well as global warming is here to stay I hope all these datacentres will be upgraded for next years even bigger heatwave?

  18. Beeblebrox

    Grauniad reads el reg

    https://www.theguardian.com/commentisfree/2022/jul/23/if-our-datacentres-cannot-take-the-heat-the-uk-could-really-go-off-the-rails

    Could become a feedback loop.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like