Whilst once upon a time I was a direct customer in that facility (and another), I'm now only an indirect (by two levels) customer. Our office connectivity monitoring alarm went off at 04:28 this morning and we're still down. LD8 is obviously key in our comms chain somewhere. Poor general communications from Equinix, even if it is a seriously major issue stretching people right now. Most unlike Equinix, based on past experience.
Outage: Faulty UPS at data centre housing London Internet Exchange causes grief for ISPs and telcos alike
One of the UK's larger data centres has suffered a major service outage affecting customers across the hosting, cloud, and telecommunications sectors. The incident was caused by a faulty UPS system followed by a fire alarm (there was no fire) that powered down Equinix's LD8 data centre, a low latency hub that was formerly the …
COMMENTS
-
Tuesday 18th August 2020 14:06 GMT Pascal Monett
"to provide the sense of scale of this outage"
That 150 companies are affected provides absolutely no sense of scale unless you know how many companies there are in total.
I do not. Is it 150 out of 300 ? That would be important. Is it 150 out of 10,000 ? That would be marginally insignificant.
So which is it ?
-
Tuesday 18th August 2020 21:34 GMT Jellied Eel
Re: Turtle magic
It's all witchcraft. And marketing. Equinix may have waved their wand over it, but for some, it'll always be HEX. So 150/300 would be wrong anyway. But I digress.
So the important number is usually hidden, and based on the number of decent sized carrier providers based in that building. And then how much space they can get to install their own kit. Then how many times that's been upgraded before realising there's no room to expand their own UPS kit, especially as the network kit has often gotten ever more power hungry. And then because of all that, power to the whole site has become ever more complex to manage.
So then you get cascading failures. If LINX switches lose power, peering across those switches drops. Some traffic may still go via private peering, assuming the kit on both ends of those links have power. Then there may be kit with LEDs still blinking happily, but isolated from the rest of the network because the big carrier they've bought capacity from has lost power to their stonking great DWDM boxen.
But such are the joys of networking. Core stuff went from <2,4Gbps to >5Tbps per rack, so when that rack goes dark, the impact is far greater.
-
Tuesday 18th August 2020 14:10 GMT Flak
That blows the four nines then!
Dual power supply and UPS will only provide so much resilience. Dual (or multiple) bits of equipment in different geographic locations with diversely routed connections and properly configured are essential to achieving high availability levels.
What needs to be understood is that an actual fault or outage at a single site often can't be fixed within the SLA agreement timeframes, particularly if they are measured monthly or quarterly and are at 99.99% or higher.
Those of you who are affected by this - I feel your pain!
Use it to invest in proper diversity and resilience if the pain is too great!
Flames - because there were none...
-
-
Tuesday 18th August 2020 19:58 GMT Martin-73
Re: That blows the four nines then!
Yes I am struggling with this even as a domestic broadband user (albeit a particularly outage averse one). Several companies offer broadband in our area (i know this isn't the same for all, and i feel privileged) but 90% of them use the same cables, same cabinet and same backhaul.
The other 10% is vermin media.
-
Tuesday 18th August 2020 21:43 GMT Jellied Eel
Re: That blows the four nines then!
Several companies offer broadband in our area (i know this isn't the same for all, and i feel privileged) but 90% of them use the same cables, same cabinet and same backhaul.
Ask for a quote for a 2nd connection with strict route seperation. Look at the excess construction charges and wince.
But it's a historical quirck. Back in the good'ol days, exchanges were star networks with lots of spokes (properties) feeding back to a single exchange. Then came LLU and the unruly mob were allowed to put kit in BT's exchanges, but still 1 (ok, 2/4) wire back to a home. And then stuff has been migrating from exchanges to street cabs.. And finally some actual competition, ie FTTH alt-nets building out their networks. So now it might be possible to have 'diverse' supply for as little as the cost of 2 broadband connections.
Then it's simply (hah!) a case of getting both to work, bearing in mind IP load balances about as well as a sumo wrestler on a pogo stick..
-
-
-
-
-
-
Tuesday 18th August 2020 14:53 GMT Rabbit80
Re: Was given the day off..
We're a small business in a business centre.. all I knew was that we couldn't access anything in the office - which is a bit of an issue when 50% of our staff work from home. Until I got to the office and confirmed our servers were ok I had no idea what was wrong - could have been a crash at our end, or worse..
-
Tuesday 18th August 2020 15:01 GMT Anonymous Coward
Re: Was given the day off..
I hear you.
Traceroute can be your friend - fire off a few from random places on the Internet (commercial/consumer VPNs can be useful here) towards your office IP or firewall. If they don't get there, it's not your office. If they reach your firewall, it's not your ISP or line and it's probably your office.
(I'm simplifying, but hopefully you get the idea).
-
-
-
-
Monday 24th August 2020 05:44 GMT Stumpy
Re: Was given the day off..
Are you sure about that?
I saw this documentary about it once: https://youtu.be/iDbyYGrswtg
-
-
-
-
-
Tuesday 18th August 2020 17:44 GMT Anonymous Coward
UPS failure often sets off the VESDA system as the current flowing through the various components is very high. A little whisp of smoke is all it takes. If you remember the Bluesquare issue at Maidenhead a few years ago, one UPS output capacitor blew and took the row of UPS devices out but the small amount of smoke set off the fire alarm system which automatically cut the power to the building and they weren't permitted to reinstate it until they had the OK from the fire brigade. The fire brigade were convinced there was a fire and they had the piece of paper to prove it, so had to have a good look around.
Should say that Bluesquare and more recently Pulsant have completely refurbished the UPS systems at Maidenhead.
-
Tuesday 18th August 2020 22:19 GMT SIP My Drink
Working For An MSP....
Is like attacking a bushfire with a chocolate hose...
i was in the thick of this today. We go through numerous ISPs to supply service.
V1 came back up at about 16.30 and services was flowing.
Expo much later as they have ALOT more kit across 3 floors in the LD8 / HEX. They had to replace 2 switches because they were buggered.
Virgin had 5 routers go pop - They will have to be salvaged and replaced if no good.
Equinix confirmed at 21:55 that they had resolved the issue. But VM are still hard down with some Clients of theirs being unhappy. Equinix have closed it as they have done their part...
Stirling Job Peeps - Now Off Down The Pub....
-
Wednesday 19th August 2020 10:16 GMT Anonymous Coward
Re: Working For An MSP....
Ouch, very Ouch. I ran a bunch of racks in a colo very close by many moons ago now. They had a big power failure and, basically, blew up a load of provider's kit. Luckily for us, our kit was all attached via remote power/reboot switches - a few of those went pop but all our main kit was fine. A rapid dash to site and and a re-patch of power cables and we were up and running again within three to four hours, as soon as the power was back, if I recall correctly. Other people there were having a *very* bad day indeed. I felt very relieved and lucky and really sorry for some of them, it was definitely not a moment to be righteous or smug at their expense. The engineers yesterday must really have felt it as well if their kit detonated. :-(
-
-
Tuesday 18th August 2020 23:00 GMT OldCrusty
Good News?
August 18, 2020 at 16:33
Additional to last:
Equinix have advised that electrical work is being carried out at the data centre whereby services (under floor sockets) require migration to a new distribution board as one has failed. There are 8 floors that require this work to be carried out. Floors one and two have been completed. As a result we still consider the services to be at risk, despite them being restored currently. Further outages may be seen until the expected completion time of 21:00hrs tonight.
-
Tuesday 18th August 2020 23:23 GMT deevee
Seems UPS's cause more downtime than actual mains power failures do.
Even large global datacentre providers that promote rock solid power, with multiple UPS's, dual supply feeds and automatic static switching, and generator backup are always suffering power outages in their data halls.
More complexity causes more outages, not less.
-
Wednesday 19th August 2020 02:05 GMT TXITMAN
EPO
EPO systems are the leading cause of these outages. EPOs are required by code in most places and shut down the power to everything. Often the EPO systems are complex, uniquely designed, and not maintained.
FYI: I never saw an Emergency Power Off switch save anyones life although it must have happened somewhere.
-
Wednesday 19th August 2020 22:39 GMT swm
Re: EPO
At Dartmouth the computer center had a big red emergency power off button. Once the computer room filled up with fog as the A/C was misadjusted and the operator hit the "big red button" probably saving some mainframes and peripherals. I think the button was pressed a second time for another very good reason.
No electronics suffered as a result.
-
-
-
Thursday 20th August 2020 09:53 GMT Martin an gof
Re: Hospitals
I suppose it depends what kind of failure you are trying to protect against. From a personal perspective at both home and work we get far, far more "short" (under a second) mains outages or surges than lengthy power losses. These are blips that would - without a UPS - cause the connected loads to reboot, with "downtime" measured according to how long it takes for the servers to come up again. With a UPS I get a flurry of warning emails*, but everything carries on as normal. A UPS is very beneficial for my use-cases.
As you say, you will get the same thing with the momentary blips caused by failover tests and similar. The difference in a hospital, of course, is that a large amount of critical kit (e.g. bedside kit) has its own internal batteries, unlike a typical data centre which will typically have a massive UPS covering multiple devices. This kind of hospital kit tends to be tested very regularly too, and by "distributing" the UPS, a unit failure has very local consequences.
M.
*once I had realised that the people who installed the system at work had put the actual computers on the UPSes, but not the network switches. I mean, what? (Oh, and these were Cisco 2950 which take about a minute from power-on to come out of the STP "learning" phase and actually pass packets)
-
Thursday 20th August 2020 13:40 GMT Anonymous Coward
Re: Hospitals
yes its understanding what a UPS is for! Its NOT for running your DC off whilst you wait for the power to come back on! Its there to smooth over the few second outages, and to provide power long enough for stuff to shutdown if you get an outage that's going to last anything more than a few 10's of minutes (anything longer and your DC is going to get fecking hot as most times the CRAC's arn't fed from the UPS so you lose cooling anyway). Or to see you through the time it takes for your generator to kick in. Worse outage i had was due to work being done on the supply to the building, which should have caused no more than 20 mins downtime, I suggested we should have a generator on site but was told it would cost too much and "you have a DC UPS don't you?". Anyway long story short, total fuckup which resulted in a 1h 30min outage that fecked a lot of our systems,I did get to say I told you so ;o)
-
-
-
-
Wednesday 19th August 2020 10:12 GMT Anonymous Coward
Hit us
Hit our main Virgin line but not our backup BT line. So Virgin line was down all day but we managed to switch over to the backup line after a fight with the firewalls. The Virgin line didn't come back until about 10pm.
I was stuck in the server room most of the day helping. Being thankful for new tech meaning the tablet was playing episodes of Midnight Caller and Columbo in the background that I could listen to via the bluetooth headset to drown out the noise of the fans and aircon.
Who says "The Cloud" never fails.
-
Wednesday 19th August 2020 10:50 GMT Anonymous South African Coward
Damned if you do, damned if you don't.
So... whether you host it on-prem or "in the cloud", you're stuck if this happen to a core router and you cannot access your data...
Ah, the joys of IT.
icon --> getting ready to get out of IT, had more than enough stress, BillyWindows brownstuff and just stuffups in general.
-
-
Wednesday 19th August 2020 16:29 GMT Jellied Eel
See? Even the postcode is HEX!
But you could have been living even closer. Due to the march of progress, or at least property developers, at one point it was rumored that the DC would be closed and redeveloped as luxury apartments. Which would have made life interesting given the concentration of IT, telecomms and the Internet around Docklands/Isle of Dogs.
-