back to article Outage: Faulty UPS at data centre housing London Internet Exchange causes grief for ISPs and telcos alike

One of the UK's larger data centres has suffered a major service outage affecting customers across the hosting, cloud, and telecommunications sectors. The incident was caused by a faulty UPS system followed by a fire alarm (there was no fire) that powered down Equinix's LD8 data centre, a low latency hub that was formerly the …

  1. Anonymous Coward
    Anonymous Coward

    Whilst once upon a time I was a direct customer in that facility (and another), I'm now only an indirect (by two levels) customer. Our office connectivity monitoring alarm went off at 04:28 this morning and we're still down. LD8 is obviously key in our comms chain somewhere. Poor general communications from Equinix, even if it is a seriously major issue stretching people right now. Most unlike Equinix, based on past experience.

  2. Anonymous Coward
    Anonymous Coward

    Our connection suddenly sprang back to life about 10 minutes ago, having been off all day.

    Communication has been very poor.

  3. karlkarl Silver badge

    Have they tried turning it off and on again?

    Preferably leaving it off.

    ... the fire, that is.

  4. Pascal Monett Silver badge
    FAIL

    "to provide the sense of scale of this outage"

    That 150 companies are affected provides absolutely no sense of scale unless you know how many companies there are in total.

    I do not. Is it 150 out of 300 ? That would be important. Is it 150 out of 10,000 ? That would be marginally insignificant.

    So which is it ?

    1. NE-bot

      Re: "to provide the sense of scale of this outage"

      I think it's more that quite a few of these companies will be internet providers, such as exponential-e

    2. Anonymous Coward
      Anonymous Coward

      Re: "to provide the sense of scale of this outage"

      It’s more a case if there where 150 companies who directly use that facility and all where impacted.

      We use different equinix facilities but some of our suppliers use ld8 and where down all day.

    3. Jellied Eel Silver badge

      Re: Turtle magic

      It's all witchcraft. And marketing. Equinix may have waved their wand over it, but for some, it'll always be HEX. So 150/300 would be wrong anyway. But I digress.

      So the important number is usually hidden, and based on the number of decent sized carrier providers based in that building. And then how much space they can get to install their own kit. Then how many times that's been upgraded before realising there's no room to expand their own UPS kit, especially as the network kit has often gotten ever more power hungry. And then because of all that, power to the whole site has become ever more complex to manage.

      So then you get cascading failures. If LINX switches lose power, peering across those switches drops. Some traffic may still go via private peering, assuming the kit on both ends of those links have power. Then there may be kit with LEDs still blinking happily, but isolated from the rest of the network because the big carrier they've bought capacity from has lost power to their stonking great DWDM boxen.

      But such are the joys of networking. Core stuff went from <2,4Gbps to >5Tbps per rack, so when that rack goes dark, the impact is far greater.

      1. laughthisoff

        Re: Turtle magic

        To some of us, 'HEX' will always mean Redbus, not Telecity ;-) #borgedbyequinix

  5. Flak
    Flame

    That blows the four nines then!

    Dual power supply and UPS will only provide so much resilience. Dual (or multiple) bits of equipment in different geographic locations with diversely routed connections and properly configured are essential to achieving high availability levels.

    What needs to be understood is that an actual fault or outage at a single site often can't be fixed within the SLA agreement timeframes, particularly if they are measured monthly or quarterly and are at 99.99% or higher.

    Those of you who are affected by this - I feel your pain!

    Use it to invest in proper diversity and resilience if the pain is too great!

    Flames - because there were none...

    1. tip pc Silver badge

      Re: That blows the four nines then!

      Yep

      Some people think paying for “resilient” “diverse” links to their cab is resiliency.

      Resiliency is doing that to symmetrically configured kit in 2 geographically different areas.

      1. Martin-73 Silver badge

        Re: That blows the four nines then!

        Yes I am struggling with this even as a domestic broadband user (albeit a particularly outage averse one). Several companies offer broadband in our area (i know this isn't the same for all, and i feel privileged) but 90% of them use the same cables, same cabinet and same backhaul.

        The other 10% is vermin media.

        1. Jellied Eel Silver badge

          Re: That blows the four nines then!

          Several companies offer broadband in our area (i know this isn't the same for all, and i feel privileged) but 90% of them use the same cables, same cabinet and same backhaul.

          Ask for a quote for a 2nd connection with strict route seperation. Look at the excess construction charges and wince.

          But it's a historical quirck. Back in the good'ol days, exchanges were star networks with lots of spokes (properties) feeding back to a single exchange. Then came LLU and the unruly mob were allowed to put kit in BT's exchanges, but still 1 (ok, 2/4) wire back to a home. And then stuff has been migrating from exchanges to street cabs.. And finally some actual competition, ie FTTH alt-nets building out their networks. So now it might be possible to have 'diverse' supply for as little as the cost of 2 broadband connections.

          Then it's simply (hah!) a case of getting both to work, bearing in mind IP load balances about as well as a sumo wrestler on a pogo stick..

  6. NE-bot

    Tentative fix estimate

    We're being affected. Virtual1 who is also being affected, have a very tentative fix estimate of 16:30

  7. Rabbit80

    Was given the day off..

    I got the alarm at 6am to say our network was down. Got to the office early, discovered that we have no internet and no estimated time for it to be resolved. Decided I'm not sitting and twiddling my thumb's all day so left by 9.30am :)

    1. Anonymous Coward
      Anonymous Coward

      Re: Was given the day off..

      Didn't you check first from home to avoid the journey in? ;-)

      1. Rabbit80

        Re: Was given the day off..

        We're a small business in a business centre.. all I knew was that we couldn't access anything in the office - which is a bit of an issue when 50% of our staff work from home. Until I got to the office and confirmed our servers were ok I had no idea what was wrong - could have been a crash at our end, or worse..

        1. Anonymous Coward
          Anonymous Coward

          Re: Was given the day off..

          I hear you.

          Traceroute can be your friend - fire off a few from random places on the Internet (commercial/consumer VPNs can be useful here) towards your office IP or firewall. If they don't get there, it's not your office. If they reach your firewall, it's not your ISP or line and it's probably your office.

          (I'm simplifying, but hopefully you get the idea).

      2. Stumpy
        Happy

        Re: Was given the day off..

        He would have done, but his Internet was down

        1. Anonymous Coward
          Anonymous Coward

          Re: Was given the day off..

          He said his office Internet was down, he didn't say his *home* Internet was down. It's not like the whole internet is just one small black box with a little red light on the top or anything...

          1. Stumpy

            Re: Was given the day off..

            Are you sure about that?

            I saw this documentary about it once: https://youtu.be/iDbyYGrswtg

  8. Anonymous Coward
    Anonymous Coward

    We're back up. 11 hours and 29 minutes of downtime :-(

    1. Tilda Rice

      Some of it is down again, certainly for BT. Something about a new distribution board is required something something

    2. Rabbit80

      Ours just come back up now.. 18 hours of downtime!

    3. TXITMAN

      Gotta mouse in your pocket AC? Where, what, when?

  9. F2H

    We're still down. Appalling lack of updates from Equinix.

    1. OldCrusty

      Appalling lack of updates from anyone.

      My AAISP FTTC still down since 4:23am.

      1. Martin-73 Silver badge

        Owch, along with their LNS outage last night, AAISP has taken a beating today! But they're on the ball as always

  10. OldCrusty
    Joke

    TITSUP

    Titans Interrupted Telemetry Sockets Unfitfor Purpose

  11. Anonymous Coward
    Anonymous Coward

    UPS failure often sets off the VESDA system as the current flowing through the various components is very high. A little whisp of smoke is all it takes. If you remember the Bluesquare issue at Maidenhead a few years ago, one UPS output capacitor blew and took the row of UPS devices out but the small amount of smoke set off the fire alarm system which automatically cut the power to the building and they weren't permitted to reinstate it until they had the OK from the fire brigade. The fire brigade were convinced there was a fire and they had the piece of paper to prove it, so had to have a good look around.

    Should say that Bluesquare and more recently Pulsant have completely refurbished the UPS systems at Maidenhead.

    1. sitta_europea Silver badge

      "... Should say that Bluesquare and more recently Pulsant have completely refurbished the UPS systems at Maidenhead."

      So no let-up in the spam then.

  12. AxelF

    Still down

    We're a customer of Gamma and are still down at 19:26.

    15 hours so far although our backup link remains up.

  13. Anonymous Coward
    Facepalm

    So it seems like 150 companies who were in the DC got affected....

    But those 150 companies include most of the largest ISPs and telecoms carriers in the UK.

    So its kind of like saying "The victim was stabbed once, but it was right in the heart."

  14. SIP My Drink

    Working For An MSP....

    Is like attacking a bushfire with a chocolate hose...

    i was in the thick of this today. We go through numerous ISPs to supply service.

    V1 came back up at about 16.30 and services was flowing.

    Expo much later as they have ALOT more kit across 3 floors in the LD8 / HEX. They had to replace 2 switches because they were buggered.

    Virgin had 5 routers go pop - They will have to be salvaged and replaced if no good.

    Equinix confirmed at 21:55 that they had resolved the issue. But VM are still hard down with some Clients of theirs being unhappy. Equinix have closed it as they have done their part...

    Stirling Job Peeps - Now Off Down The Pub....

    1. Anonymous Coward
      Anonymous Coward

      Re: Working For An MSP....

      Ouch, very Ouch. I ran a bunch of racks in a colo very close by many moons ago now. They had a big power failure and, basically, blew up a load of provider's kit. Luckily for us, our kit was all attached via remote power/reboot switches - a few of those went pop but all our main kit was fine. A rapid dash to site and and a re-patch of power cables and we were up and running again within three to four hours, as soon as the power was back, if I recall correctly. Other people there were having a *very* bad day indeed. I felt very relieved and lucky and really sorry for some of them, it was definitely not a moment to be righteous or smug at their expense. The engineers yesterday must really have felt it as well if their kit detonated. :-(

  15. OldCrusty

    Good News?

    August 18, 2020 at 16:33

    Additional to last:

    Equinix have advised that electrical work is being carried out at the data centre whereby services (under floor sockets) require migration to a new distribution board as one has failed. There are 8 floors that require this work to be carried out. Floors one and two have been completed. As a result we still consider the services to be at risk, despite them being restored currently. Further outages may be seen until the expected completion time of 21:00hrs tonight.

  16. deevee

    Seems UPS's cause more downtime than actual mains power failures do.

    Even large global datacentre providers that promote rock solid power, with multiple UPS's, dual supply feeds and automatic static switching, and generator backup are always suffering power outages in their data halls.

    More complexity causes more outages, not less.

    1. TXITMAN

      EPO

      EPO systems are the leading cause of these outages. EPOs are required by code in most places and shut down the power to everything. Often the EPO systems are complex, uniquely designed, and not maintained.

      FYI: I never saw an Emergency Power Off switch save anyones life although it must have happened somewhere.

      1. swm

        Re: EPO

        At Dartmouth the computer center had a big red emergency power off button. Once the computer room filled up with fog as the A/C was misadjusted and the operator hit the "big red button" probably saving some mainframes and peripherals. I think the button was pressed a second time for another very good reason.

        No electronics suffered as a result.

    2. Anonymous Coward
      Anonymous Coward

      Hospitals

      Except in hospitals, where diesal generator tests cause more momentary power outages than UPS's and that's where UPS's are very valuable.

      1. Martin an gof Silver badge

        Re: Hospitals

        I suppose it depends what kind of failure you are trying to protect against. From a personal perspective at both home and work we get far, far more "short" (under a second) mains outages or surges than lengthy power losses. These are blips that would - without a UPS - cause the connected loads to reboot, with "downtime" measured according to how long it takes for the servers to come up again. With a UPS I get a flurry of warning emails*, but everything carries on as normal. A UPS is very beneficial for my use-cases.

        As you say, you will get the same thing with the momentary blips caused by failover tests and similar. The difference in a hospital, of course, is that a large amount of critical kit (e.g. bedside kit) has its own internal batteries, unlike a typical data centre which will typically have a massive UPS covering multiple devices. This kind of hospital kit tends to be tested very regularly too, and by "distributing" the UPS, a unit failure has very local consequences.

        M.

        *once I had realised that the people who installed the system at work had put the actual computers on the UPSes, but not the network switches. I mean, what? (Oh, and these were Cisco 2950 which take about a minute from power-on to come out of the STP "learning" phase and actually pass packets)

        1. Anonymous Coward
          Anonymous Coward

          Re: Hospitals

          yes its understanding what a UPS is for! Its NOT for running your DC off whilst you wait for the power to come back on! Its there to smooth over the few second outages, and to provide power long enough for stuff to shutdown if you get an outage that's going to last anything more than a few 10's of minutes (anything longer and your DC is going to get fecking hot as most times the CRAC's arn't fed from the UPS so you lose cooling anyway). Or to see you through the time it takes for your generator to kick in. Worse outage i had was due to work being done on the supply to the building, which should have caused no more than 20 mins downtime, I suggested we should have a generator on site but was told it would cost too much and "you have a DC UPS don't you?". Anyway long story short, total fuckup which resulted in a 1h 30min outage that fecked a lot of our systems,I did get to say I told you so ;o)

  17. Anonymous Coward
    Anonymous Coward

    Talk talk

    As usual, 100% uptime with talk-talk!

    1. tiggity Silver badge

      Re: Talk talk

      Talk Talk business was down for lots of customers

  18. Ken Moorhouse Silver badge

    The Reg has to commend for giving some wonderfully descriptive updates

    I wandered lonely as a cloud

    That floats on high o'er vales and hills,

    You mean this kind of descriptive?

    1. sitta_europea Silver badge

      Re: The Reg has to commend for giving some wonderfully descriptive updates

      "... You mean this kind of descriptive?"

      Yeah, tell us!

  19. Stu

    Outage

    We're with Exponential-E, our service didn't resume until the early hours of the following morning, roughly a 22 hour outage!

    Nice to be high priority!

  20. Anonymous Coward
    Anonymous Coward

    Hit us

    Hit our main Virgin line but not our backup BT line. So Virgin line was down all day but we managed to switch over to the backup line after a fight with the firewalls. The Virgin line didn't come back until about 10pm.

    I was stuck in the server room most of the day helping. Being thankful for new tech meaning the tablet was playing episodes of Midnight Caller and Columbo in the background that I could listen to via the bluetooth headset to drown out the noise of the fans and aircon.

    Who says "The Cloud" never fails.

  21. Anonymous South African Coward Bronze badge
    Coat

    Damned if you do, damned if you don't.

    So... whether you host it on-prem or "in the cloud", you're stuck if this happen to a core router and you cannot access your data...

    Ah, the joys of IT.

    icon --> getting ready to get out of IT, had more than enough stress, BillyWindows brownstuff and just stuffups in general.

    1. Anonymous Coward
      Anonymous Coward

      i'm the same really wanting to bale out of IT. If it wasn't for the wife getting made redundant due to covid sh1t show I would have gone already

  22. FBee

    uninterrupted power supply??

    I'd say it WAS interrupted! Uninterruptible Power Supply STAT

  23. sketharaman

    OMG, this is close - next door to my ex-house in Meridian Place E14 9FF!

    1. Jellied Eel Silver badge

      See? Even the postcode is HEX!

      But you could have been living even closer. Due to the march of progress, or at least property developers, at one point it was rumored that the DC would be closed and redeveloped as luxury apartments. Which would have made life interesting given the concentration of IT, telecomms and the Internet around Docklands/Isle of Dogs.

  24. Delta Oscar

    Eggs... basket??

    Am I misreading what happened? I would hate to think if there was really determined physical attack on these installations.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like