Well...
At least they admit their failures and don't pretend everything is just fine.
Google has revealed the root cause of the outage that disrupted services at its europe-west2-a zone, based in London, during a recent heatwave. "One of the datacenters that hosts zone europe-west2-a could not maintain a safe operating temperature due to a simultaneous failure of multiple, redundant cooling systems combined …
At least it appeared to be be a controlled shutdown which while causing some disruption should mean an orderly restoration of services should be within the contingency plans with, hopefully, nothing lost. In technical terms - a hiccup.
Whereas poor St Thomas' & Guys also had an unplanned shutdown same day due to the heat and unspecified issues (cooling or power or both?). It appears that it was less controlled and contingency plans failed completely at the time.
Worse, restoration of services is still, I hear, not complete. Patient welfare was/is seriously compromised. That's probably lethal or permanently damaging to some.
I'm guessing the pressure to run NHS 'on the cheap' compared to other health systems means when it comes down to retaining excellent IT staff to manage their way out of catastrophic situations or having sufficient redundancy has been eroded over time. Similar situation at a University not very far down the road that took months to restore services from a hack.
Hence I have some sympathy to the remaining IT staff who take the immediate blame and are expected to restore the situation pronto and then get more blame when they don't..
> retaining excellent IT staff to manage their way out of catastrophic situations
You don't have an MBA, do you: Excellent staff usually commands prohibitive salaries, and handling catastrophes is what we pay insurance for. Service disruption is not an issue, especially if we have a captive clientele, like NHS or universities.
> the hottest day on record in London
Not helped by London being generally warmer than areas outside the city. An effect worsened by all the datacenrtes enormous power consumption.
In addition, given the serious lack of electricity infrastructure in the capital, you have to wonder how sensible it is to locate datacentres there. Or to give them planning permission.
It isn't as if they create that many jobs, either.
Since all the reports were referring to Pacific time, my impression was that the reference to "London" meant somewhere in SE England with an 020 dialling code. Still expensive but I'm guessing the extra cost of location was outweighed by the benefits of having it near a large number of users and better communication.
Apparently that moniker comes from "an hour's travel" which makes Stansted worse than Oxford. Even us Oxfordians shake our head at that ridiculous rebranding exercise (and it's not helped - at all). It continues to just be a distant cousin of Biggin Hill or Farnsborough with lots of corporate jet traffic.
The only reason the massive data centre in Olympic Park exists because it was the press centre for the 2012 Olympics and the plan was to put Amazon in it (and they didn't want it). Whether any of the universities from oop T'North use any of it (since it's just on the other side of the park from them) is another guess.
But yeah... London is getting hotter every year, and yeah, cramming more data centres into Canary Wharf and other parts of East/Central/West London is not helping.
> you have to wonder how sensible it is to locate datacentres [in London].
True, but don't forget how many financial and fintech companies are in London, and the importance of locality and speed for certain missions, eg. High Frequency Trading (for the more general case beyond the nearby/ colocated FPGA stuff). Or where a lot of data benefits from being near another load of related data for big data operations where transit latency would be multiplicative.
Unless they could all agree where to keep their operations outside London but retain the locality benefits.
The datacentres in question are almost all based around Slough, not closer in
They were encouraged to setup there BECAUSE of readily available infrastructure and because siting them under the heathrow approach/departure path prevents people building houses there and complaining about the noise
I've questioned the wisdom of such siting ever since I found out about it. Aircraft occasionally fall out of the sky and taking out data centres might not be "loss of life" but it can be economically devastating
At least one airport I know of spent billions quietly purchasing all the land under the approach paths to ensure it stayed undeveloped farmland (or warehousing, closer in). Kinda hard with places like Heathrow but something worth considering for ones further out
> all the land under the approach paths
From what little I have observed, you hear the planes just as well on both sides of their flight path. Probably less than if they fly right over your head, but still loud enough to be annoying. It's a difficult problem, since airports need to be as close as possible to big cities, but cities don't want to be anywhere near airports...
Farmland around an airport would seem a good idea, except airports don't want too many birds living nearby, and there is nothing like the free buffet of a field to attract large amounts of birds. (Also the pollution might be a little higher than elsewhere.)
Warehouses and datacenters might be the ideal solution, as long as the constant noise (vibrations) doesn't break things inside.
It's a difficult problem, since airports need to be as close as possible to big cities, but cities don't want to be anywhere near airports...
Not difficult, really. Put the entry point in the city and have a tram every minute that moves people a few miles to the actual terminal near the runway.
Alternately, I'd certainly enjoy seeing a kilometers-long autowalk moving at 100km/h.
"Aircraft occasionally fall out of the sky and taking out data centres might not be "loss of life" but it can be economically devastating".
I worked at a company (in the travel business) next door to their main Data Centre which was allegedly hardened to cope with an airplane strike.
"At least one airport I know of spent billions quietly purchasing all the land under the approach paths to ensure it stayed undeveloped farmland (or warehousing, closer in). Kinda hard with places like Heathrow but something worth considering for ones further out"
They don't need to do this in the UK for airport operational safety. Airports have a safeguarded area around them. Any building or development applications which would compromise the safeguarding would be referred by the council planning inspector to the airport who can request a safeguarding report from the developer and veto it if it doesn't meet their requirements.
Heathrow's also been at it with the "buying on the sly". They bought quite a bit of the two villages they want to raze for the third runway; of course Greenpeace, HACAN etc all caught a whiff of it and bought land too (and sold off parcels to their supporters - trying to do a Narita in West London/Middlesex).
It makes a lot of sense to buy the land, sit on it, rent it out until the time comes that the development starts and you start booting people out.
Keep in mind it's a San Francisco company. And by San Francisco, I mean somewhere in the SF Bay area, which is a chunk of California that includes places nearly a hundred miles from the city limits of SF.
"London" is pretty much anything south of Scotland as far as most Americans are concerned.
Move it to Manchester. Or somewhere outside London.
Yeah it's easier said than done.
But:
1. Land / property is cheaper than London
2. Big northern cities have the right infrastructure
3. It's generally a few degrees cooler (although not an exact science)
People really need to learn a lesson that London is shit and just because you have expensive property / infrastructure there it can still suffer failure. Of course that doesn't mean everything in the north just works - but at least you'll have a few extra quid to spend on some fans!
The report doesn't explain why the cooling systems failed. It's because they didn't have redundant cooling systems. That's really all it can be. In a fucking datacentre, in London, owned by one of the richest companies in the world.
> "It's because they didn't have redundant cooling systems. That's really all it can be."
But they (in theory) have redundant datacentres instead - they're needed anyway, so might as well use the ability to failover between them.
In this case it turned out that the redundancy they thought they had wasn't as good as they thought it was.
They mentioned redundant cooling systems
Odds are there was a SPOF in there someone overlooked
I'm currently having to refrain from smashing someone's head into a desk for designing exactly thie issue into the redundant systems of my datacentre - DESPITE my having warned to avoid doing so. It's going to cost ~50k to fix on a 150k cooling installatiion but would have added less than 5k to the installed cost
Said individual got patted on the back for saving money - and the fixup costs won't fall on him
"Said individual got patted on the back...."
With a baseball bat? Or, since this is a UK based story, a cricket bat?...
"Odds are there was a SPOF in there someone overlooked..."
Usually the short sighted senior level manager/director who signed off on the cheaper option because it was, well cheaper.
"I'm currently having to refrain from smashing someone's head into a desk..."
Doors work just as well and usually easier to get away with as "accidental". FD60/120 doors work really well.
Oh sorry, was that your head on the door? I'm sure the dent will come out.
Of course they are. But we already had datacentres before the much touted cloud. One just had to organize the multihoming himself or delegate it to a provider. Big customers already had the requisite elasticity in performance and resources even on-site (I remember daily core count and memory size adjustment in a leased on-premise server in 2005).
I'm sure everyone here already understands "the cloud" as "someone else's computer", my rant was meant more towards the manglement, marketing and beancounter side of affairs.
Because in this case everything worked entirely as designed?
If you are deploying your workloads in the cloud in a single availability zone, you have no business being allowed near a console - or at least, you don't get to complain if you have downtime. If you were deploying them across multiple availability zones, then you didn't have a problem.
No, not exactly. There are global services and regional/local services. A single VM is mostly local. It can be migrated under certain circumstances but this is a sensitive process and many users care about where their workloads run. You can't just move a VM from England to Germany without the user knowing. The same is true about disk storage.
Users in regional or zonal services need to make sure that they don't rely on a single zone/region because the building can literally be destroyed by an accident. So either you use higher level services that do this for you, or you use lower level services and you're responsible for ensuring redundancy.