back to article Electrical box fault blamed for GS2 data centre outage

A power outage at "Europe’s largest purpose-built data centre”, Global Switch 2 – which knocked one customer offline for two days – has been blamed on a high voltage fault. All customers lost access to services based in the GS2 data centre on Saturday 10 September, according to an interim incident report issued by hosting …

  1. Hans Neeson-Bumpsadese Silver badge

    (un)interruptible power supply

    on-site diesel rotary uninterruptible power supply devices

    Sounds like the whole place is run by a bunch of Wankels.

    1. Anonymous Coward
      Anonymous Coward

      Re: (un)interruptible power supply

      On the roof they have 12 or 13 DRUPS* if I recall correctly. I have no idea how they feed the various floors, but when they last replaced them all 7 or 8 years ago, one went titsup at the wrong time and 3 floors had sections that had total loss of power (or was it that the entire floors lost power? can't remember).

      I was talking to a customer while installing equipment in there in the last couple of weeks and he mentioned that they have had 2 outages (total outage, both PDU's) in the last 3 months, no downtime at all in the 5 years previous to that. The big problem they had was their hypervisors went down in a complete heap and of course all the VM's were in shit as well.

      *someone told me that they use Euro Diesel DRUPS a few years ago when we were waiting for GS2 nazi security to allow me in despite my surname being obviously misspelled: http://www.euro-diesel.com/english/system-description/110/2

  2. The Original Steve

    I don't understand

    Do Claranet and their clients not have UPS's in the racks? Few ms shouldn't cause service down.

    1. Blotto Silver badge

      Re: I don't understand

      why buy individual rack ups when the hosting provider guarantees the whole building is UPS'ed?

      Thats why you host at a hosting provider, they can afford the expense of whole building UPS with generators to back that up.

      problems happen, its how you deal with the problems that matters, regardless if you have your own UPS/Genies/DC's or not.

      1. frank ly

        @Blotto Re: I don't understand

        It's no use having a Genie if the lamp isn't maintained and the personnel know how to rub it properly. Also, don't even think about using a Wizard.

    2. Pangasinan

      Re: I don't understand

      Each box should have dual PSU modules fed from different AC power sources.

    3. Anonymous Coward
      Anonymous Coward

      Re: I don't understand

      When we put our kit into the co-lo DC we're in now (only a year or so ago), the Centre Manager told us that he thinks we're the only one with UPS's in our own racks, out of all the tenants...

      So - why bother? 'just in case'.

      How deep do you want your redundancy to go... Guess that answers the question with Claranet.

      1. Lee D Silver badge

        Re: I don't understand

        What's the point in a UPS in a cabinet which is in a datacenter which is unpowered?

        Sure, it's a "gentler" shutdown in the event of a prolonged outage but if their switches, routers, etc. are offline, so is your sever even if it's still spinning.

        UPS usually take up another couple of U. That costs in a co-lo.

        And they need battery replacements on a regular basis.

        You might have rode the power-cut out, but if you're offline for 5 minutes because everything else is rebooting or offline for 2 days in order to sort the mess out, that UPS is basically worthless.

        That's why most people don't bother.

        Next question - why is the datacentre rack not UPS'd for you, which would have made the outage a literal 222ms for everyone?

        And, I'll tell you now, UPS do not protect against everything. I have seen UPS hard shutdown with error conditions if they think there's something wrong with the source power. Don't believe me? Try putting an APC UPS on a system with a crossed-phase. Sure, the fuse trips, the power goes off, but the UPS also goes "WOAH!" and just shuts down hard beeping like mad.

        In my case, it was a caterer plugging in hotplates with lamps and managing to plug them in on two different phases by using two far-apart sockets (still shouldn't happen, but did in this case). The UPS at the other end of the site on one of those phases gave up immediately. And repeatedly did so four-five times when we were trying to find the cause.

        The second the phase problems were fixed, the UPS worked as expected and held a full load for the time expected. But the simplest of wiring faults, or problems with the source power, like you might get with large datacentres, generators and suchlike can knock even the most expensive UPS for six anyway. They are required to preserve safety over your servers. Same for "fireman's switches".

        1. Captain Scarlet Silver badge

          Re: I don't understand

          People and companies pay a lot to co-locate in London, as UPS's are not 100% efficient customers will also end up paying for more power consumption.

      2. Blotto Silver badge

        Re: I don't understand

        are your ISP's routers in your rack or in the ISP's rack and do they have their own UPS?

        1. Anonymous Coward
          Anonymous Coward

          UPS in the rack

          In the long run, that almost certainly exposes you to greater failure risks. Though I suppose if you're clustered, and the clustered systems are in different racks on different UPSes, you mitigate that issue.

          Still, like another poster said, the reason you have building wide UPS is so you don't also need rack level UPS. I've never seen servers with rack UPSes in a building protected by UPS. Not saying it doesn't happen (some posters here sound like they're doing it) but the issue here is that the building level UPS wasn't working - a 222 ms outage shouldn't affect a building full of servers that should be running off battery 24x7 - assuming it is configured as an "online UPS" rather than relying on something to cut over to battery when the utility power fails. Something which obviously failed in this case.

    4. Vince

      Re: I don't understand

      Some datacentre's forbid you installing your own UPS equipment. Yep, really.

      1. patrickstar

        Re: I don't understand

        Ever seen a UPS blow up? Then you'd be vary of having one anywhere near sensitive equipment.

        I'd personally strangle you if I saw you putting a box full of potential hydrogen/oxygen explosives, a.k.a. batteries, in the rack next to my gear. Even if they don't blow up, just the heat generated from the hundreds of amps passing through a solid short of a decent battery pack is enough to cause a lot of trouble both for your gear and the neighbors.

        Also, small UPSes tend to have shorter MTBFs than the power in a proper facility. Hell, even shorter than the grid power in some places.

        1. Nate Amsden

          Re: I don't understand

          haven't seen a UPS blow up myself but I killed one of my UPSs about a month ago, replaced the batteries, had them wired up wrong. UPS committed suicide I guess to save my equipment. Breaker tripped and started making clicking noises and smelled some smoke. Some internal parts connecting the batteries were melted. This particular UPS had 9 batteries in it, rack mount double conversion sine wave UPS, not a tiny little thing with a single battery. Frustrated that I was too lazy to double check the voltage before I hooked it back up again. I checked it the first time I replaced the batteries, then 2 of the batteries were bad and replaced them, I was over confident I wired it up right and that cost me about $700.

          I suppose I did sort of have a UPS blow up about 12 years ago though now that I think more, power failure and one of the pieces of equipment connected to it did not like the voltage the UPS gave off in battery mode so it reacted badly and the UPS sort of blew up, chassis was warped, batteries were leaking, but besides that no damage to anything.

          Never seen a big UPS blow though, nor have I heard any stories about such an event. Only been to one data center in past 16 years that suffered power failure on both redundant circuits. Since that time I haven't taken power for granted. I have not had a single feed go down in about 7 years at any facility(which for me has been 4 different facilities in that time).

  3. Sir Barry

    At least

    they didn't claim "only a small number of customers were affected"

  4. Tridac

    Completely unacceptable that such a large data center should go down on a quarter second interrupton. Battery UPS first line, followed by generators with sump heaters that should be online in less than a minute. Of course, battery based ups for that sort of power cost serious money to buy and maintain, but there's no excuse for skimping for a major data center. A single hv feed as well. What were they thinking.

    Oh yes, what exactly is "on-site diesel rotary uninterruptible power supply devices" ?. Are they running 24x7, or what ?...

    1. Clockworkseer

      DRUPS. Flywheel UPS.Always running. The idea is when your main power kicks out, the flywheel keeps the power running and then the diesel generator kicks in to takeover the load. This means they have to have loads of them so they can afford to have a couple out of service at any given time for repairs (they take a fair bit more maintenance than battery systems, but seem to be preferred because of higher current output.

      Wikipedia has a couple of good(ish) articles on the subject.

  5. FurtherDownSouth

    Thats the Wrong Wrench Love...

    The cable terminations of the remaining available DRUPS have been checked...

    So basically, they forgot to tighten the screws...

  6. Anonymous Coward
    Anonymous Coward

    Strange, I'm not aware of any of the servers I look after in GS2 complaining on the day in question. I'll have to ask our own DC/facilities guys if they were aware of anything.

  7. Anonymous Coward
    Anonymous Coward

    This isn't a GS2 Issue

    I quite applaud GS2 for saying they're unhappy about a 222ms drop on one particular feed. Pretty much all suites/services in GS2 have feeds from different power sources, and anything important should have Dual PSUs/Quad/Triplet PSUs plugged into multiple feeds to ensure an outage on one system doesn't tank your systems.

    Put bluntly, it's pointless turning upto a facility like GS2, with all that redundancy and then plugging your servers in with ONE powercable...

    One of several machines I have access to in that building reports the following;

    "14:53:30 up 82 days"

    Not a blip on any of them and that's a VMWare platform running across several hosts/san systems and nothing dropped at all. If an entire power system failed, we'd get alerts, but everything would carry on fine unless both of the systems went offline, which would then be a different scenario. Loosing one feed in a DC should be a business as usual scenario.

    If you've put a highly resilient system into a datacentre, and it can break because the site has a 222ms blip on one power feed, then you've not designed it properly. Put bluntly, when designed correctly, you should be expecting to occasionally lose a supply because of maintenance/site/environmental/rack pdu issues and should dual home your power anyway.

    1. Anonymous Coward
      Anonymous Coward

      Re: This isn't a GS2 Issue

      Unfortunately you clearly don't know how GS2 is designed. You have to think of GS2 is really 4 data centres each with their own Tier3+ designed power system which constitutes lots of DRUPs and other kit. This time power system H1 failed and took with it 3 floors. No floor, or suite within a floor or rack within a suite is powered by more than one power system. If you didn't see an issue then that's because you're in the 3/4 of the building that's powered from a different power system.

  8. EveryTime

    Reliability is difficult.

    Adding a UPS or redundant power supply can easily reduce the reliability of a system.

    Two decades ago I was running clusters of cheap machines beside several high end machines. We were on building power, not on the big UPS. One of the clusters reached two years of kernel uptime on every machine but one. The other machines in the room never reached 3 months during that time. The big UPS blew up twice, once due to a wrench dropped during a routine inspection service. A big network switch had repeated problems with the power combining circuit for the redundant power supplies melting, while similar switches with a single power supply were reliable.

    That doesn't even cover the software problems. A machine that had SECDED ECC memory would subtly corrupt memory when it tried to map out a page that experienced a corrected fault. Changing a non-problem into untraceable data corruption.

  9. nucotech

    Hi,

    You have to be careful putting a UPS into a rack enclosure/data hall (ie not do it). UPS rooms are separate to data halls for a reason. Tier IV recommends at least two hours fire rating between power infrastructure and data halls. USP systems and batteries should be checked at least twice yearly and terminal torques at least every few years.

    Batteries are a self contained ignition source. This means that you can put a fire out and it will come back. Data halls tend to be protected by an inert gas which once discharged has to then be refilled. This means if you had a battery fire in a data hall the gas would discharge and then if the a battery reignited there would be nothing to stop it and you could lose all your data. This is also why gas suppression in a battery room has limited benefit without automatic link breakers (ie reduce the voltage in the string).

    Andrew

  10. swm

    It can be done correctly

    It can be done correctly. Paychex has six copies of its data center spinning in geographically distinct locations. Each has UPS and diesel generators. We had a utility power failure that lasted a good part of a week. I asked the Paychex representative how well their systems worked during this total loss of utility power. He said that everything worked perfectly. What did not work was their customers who had no adequate power plans. So Paychex had to rent trucks to visit their customers and hand-carry the data required to generate pay checks and deliver the completed pay checks back to their customers.

    They could have lost five of their data centers and still been in business. This is a company that understands where their crown jewels are and protects them correctly. They don't trust clouds or third-party suppliers for this (or at least didn't a few years ago).

  11. Anonymous Coward
    Anonymous Coward

    It's crap like this that makes me love public cloud, so simple and cost effective to design a resilient solution with no SPOF. Less time fiddling with kit, more time looking after my customers.

  12. stevebp

    What exactly is "Tier 3 Enhanced" anyway?

    Or "Tier 3+" or anything else that mentions "Tier" that hasn't been independently certified? I know this site really well and I wouldn't call it "Tier 3+". Not with only one power source, albeit split into two pdus by the time it hits the cabinet. There are multiple redundant DRUPS (they operate at N+2 following a similar recent incident) in the H1 power station but that wouldn't prevent a fault such as this taking out the entire floors that H1 serve. GS decided that rewiring the floors to be served by more than one 'power station' was too expensive. I visit a lot of DCs and they all make claims such as GS do and most of them are fanciful at best. If you want to know precisely how 'available' a DC is and could be in the event of an incident, then go and speak to an independent organisation that can validate it for you - don't trust the marketing. If you're now baulking at the assumed cost, then perhaps your business isn't that critical in the first place and outages are an acceptable cost of outsourcing your services.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon