back to article OVHcloud datacenter fire last year possibly due to water leak

Late last month, France's BEA-RI, or Bureau of Investigation and Analysis on industrial risks, issued its technical report on the March 10th, 2021 fire at the OVH datacenter in Strasbourg. The French report [PDF] and summary [PDF] echo the findings of the Bas-Rhin fire service in March, 2022 that the lack of an automatic fire …

  1. Kevin McMurtrie Silver badge

    I hope Bond escaped in time

    I thought it was only in movies where you could overload a power circuit, set off violent fires and explosions everywhere, and burn the facility to the ground.

    1. Yet Another Anonymous coward Silver badge

      Re: I hope Bond escaped in time

      In movies you can do it with a virus on the mainframe.

      In real life you need to drop a USB key in the parking lot of your secret nuclear weapons lab

      1. Anonymous Coward
        Anonymous Coward

        Re: I hope Bond escaped in time

        As someone who works in power electronics, I can tell you that Bond would not necessarily have needed to be in the building or even the country to do this.

        Inverters are usually a bank of boards each with its own microcontroller or FPGA driving IGBT switch outputs, where each group of two switches (a complimentary pair or 'totem-pole') spans the high-voltage input or output rails. Sometimes they would have a dedicated totem-pole gate driver which ensures that the high-side and low-side IGBTs cannot be energised at the same time, (thus short-circuiting the supply and causing the transistors to explode violently).

        However that's not always true - Quite often the microcontroller will have its own complimentary-PWM output that is used to drive transistor bridges. And of course, it is possible to change the pin function in software on these chips, such that they are no longer in complimentary-PWM mode, but now in general-purpose output mode.

        Even if they use dedicated totem-pole gate drives, it is still eminently possible for a bad firmware to deliberately cause fire, by overheating the inductors, shifting the AC phase angle, etc.

        And of course, almost every manufacturer these days uses OTA updates. Even for the low-level controllers on the power boards. Of course the firmware is signed, and the bootloader would not accept a firmware image without a valid signature, and to craft a deliberately-dangerous firmware and then deploy it into an active installation would require insider knowledge. But that shouldn't be too much trouble for Bond. (or his Russian/Chinese counterpart..)

        He may even have noticed in advance, that the building was designed like a funeral pyre, not a datacentre, and decided that a small spark from an inverter was all that he needed.

        (Seriously, why is the architect not being hung out to dry for his Towering Inferno? I notice in the french report, that they attribute the ferocity of the fire to the building's "natural convection" design, but it is not featured in the "lessons learned" section ...)

        And that there was no means whatsoever to shutdown power to the building, again seems like design incompetence straight out of Towering Inferno.. (classic film, i'm sure everyone on here has seen it)

        1. Yet Another Anonymous coward Silver badge

          Re: I hope Bond escaped in time

          >Of course the firmware is signed

          And if the payload is invalid it will of course just erase its current settings and then shutdown because it has no valid configuration - or is that only the engines of military transport aircraft ?

        2. NeilPost

          Re: I hope Bond escaped in time

          A handful of large gravel and few cans of expanding foam filler in the outside heat exchanger fan units will screw it permanently for you. Nothing sophisticated needed

    2. Anonymous Coward
      Anonymous Coward

      Re: I hope Bond escaped in time

      I'm fingering Irwin Allen.

      The sparks are a dead giveaway.

  2. A Non e-mouse Silver badge
    Pint

    High current battery systems (UPS, electric cars) scare the crap out of me on a good day. Seeing the same (still live!) on fire... *shudder* All respect to the firefighters who had to deal with that mess.

    Beer icon, because....

  3. Pascal Monett Silver badge

    Interesting article marred by dreadful time formatting

    "The fire was subdued around 1000 after the electrical network was cut off and a pump boat arrived"

    Around 1000 what ? Seconds ? Minutes ? Or is it supposed to be 10:00 A.M. ?

    Military timing is all good, but a : every now and then wouldn't hurt.

    1. Yet Another Anonymous coward Silver badge

      Re: Interesting article marred by dreadful time formatting

      It's French metric time

      1. David 132 Silver badge
        Happy

        Re: Interesting article marred by dreadful time formatting

        “À la récherche du côlon perdu”?

        1. Anonymous Coward
          Anonymous Coward

          Re: Interesting article marred by dreadful time formatting

          Wrong sort of colon, ITYM deux-points perdu...

          1. David 132 Silver badge
            Unhappy

            Re: Interesting article marred by dreadful time formatting

            Blech. I hang ma tête in shame. I couldn’t think of the word and was misled by the (frankly, terrible) google translate interface, which suggested “côlon” and then helpfully defined the word underneath as both “part of the intestine” and “a punctuation mark”.

            In short: I am an idiot for relying on anything google-related. If they can’t even get a translator UI correct, do we really think they have a sentient AI as per the hysterical claims in the media over the last few days?

            1. Yet Another Anonymous coward Silver badge

              Re: Interesting article marred by dreadful time formatting

              >If they can’t even get a translator UI correct, do we really think they have a sentient AI

              Yes but of course it would just speak English and do ALL-CAPS if forced to talk to a foreigner

      2. Anonymous Coward Silver badge
        Boffin

        Re: Interesting article marred by dreadful time formatting

        0800z in CEST is from hereon to be known as 1 kilominute.

    2. rcxb Silver badge

      Re: Interesting article marred by dreadful time formatting

      Military timing is all good, but a : every now and then wouldn't hurt.

      $ rsync -a "10:00" remote:/tmp/

      The source and destination cannot both be remote.

      rsync error: syntax or usage error (code 1) at main.c(1275) [Receiver=3.1.2]

  4. loles

    Ironic

    I find it ironic that the possible causes of the fire are points that OVH has touted as innovation in the past (water cooling and data center design). When OVH was still talking, they mentioned that this incident could serve a lesson to the industry and they hope to establish protocols to avoid such type of disasters in the future. The fact is that industry standards already exist like ANSI/TIA-942-A Infrastructure Standard for Data Centers, ANSI/BICSI 002-2014 Data Center Design and Implementation Best Practices, and Uptime institute's Tiers to name a few. Maybe OVHcloud should look into some of these standards before trying to re-invent the wheel.

    OVHcloud has been in the game for 20-plus years. Rather than being an industry leader, they are playing catch-up. This should serve as an example as to why it pays to do things correctly the first time and not cut corners. OVH has a long history of learning as they go. Each time they have had a major incident in the past, they say they are sorry and that they have learned from the experience.

    1. Anonymous Coward
      Anonymous Coward

      Re: Ironic

      Maybe they want a standard that is cheap and easy to implement. So they don't bother to look at other official standards.

      They want to do it as cheap as possible otherwise it impacts their business model. There are other cloud providers and some customer just look at cost. Some businesses just look at cost as well. The smart ones that used OVH are the ones that left. They were stuck with lost data and very limited to no recovery options given the issues that OVH had. Their other regions were impacted as well.

      1. Anonymous Coward
        Anonymous Coward

        Re: Ironic

        "Maybe they want a standard that is cheap and easy to implement. So they don't bother to look at other official standards."

        They want something, standard or not, that costs nothing. And they've implemented it already.

        Pay sweet fuck all, no automatic suppression system, no fire containment measures (inter levels for example), nothing, just 2 sods with a portable fire extinguisher.

        And once, things inevitably go AWOL, tell the world you're gonna learn. It calms investors down.

        Works wonders, so far, until someone is (hopefully not) killed.

        1. loles

          Re: Ironic

          So, OVHcloud is more like the Lidl or Ryan air for cloud providers. But in their mind, they are a premium provider and an alternative to US and Chinese providers. Perhaps, OVHcloud needs to take a look at what they are and not what they dream to be.

          OVHcloud is like a 3rd world country gaining a position in the World Cup. They play the game but cannot compete

        2. David 132 Silver badge
          Thumb Up

          Re: Ironic

          >tell the world you're gonna learn

          No, no, no. Never “we’re going to learn”. That sounds dangerously like a commitment, and sets up actual obligations.

          Always use the passive voice. “Lessons will be learned”. Sounds good without actually committing anyone to anything.

          Corporate Communications 101!

      2. Anonymous Coward Silver badge
        Holmes

        Re: Ironic

        > "The smart ones that used OVH are the ones that left. They were stuck with lost data and very limited to no recovery options"

        I disagree. The smart ones had backups and BC plans in place that didn't involve the same provider as their primary system.

        1. Anonymous Coward
          Anonymous Coward

          Re: Ironic

          We used to be OVH clients some time before the blaze (we left due to other reasons, mainly poor availability and high cost).

          As for the backup strategy, I fought inside my company (small, 10- employees) to both bosses and coworkers for a backup strategy that, at least, had an additional backup storage outside of the main datacenter where we host everything.

          Bosses were reluctant due to costs. Coworkers (some, not all) were reluctant because, as I am working both as sysadmin and software developer, this extra work would reduce my capacity for development and thus put more pressure on them: literally, I was losing too much time on that, by the same coworker who I had to restore from these very backups because he messed HARD (rm -rf two directories upper than he should be)

          Although I managed to complete the new backup policy way before the OVH blaze, I put that as an example of why it was needed, because eventually they still got at me for my insistent pestering from time to time. Not anymore.

          AC just in case.

        2. Nate Amsden

          Re: Ironic

          There are no smart ones here. The smart ones would never of been a customer of OVH to begin with. Only reason I can see to use a provider like OVH is because you really, really don't care about just about anything (other than perhaps cost).

          The big IaaS clouds are really not much better though. They too design for facility failure and expect the customer to account for that(as we have seen many customers do not account for that, or if they do they do a poor job of it). A lot of people still believe that big names like Amazon and Microsoft have super redundancy built into their stuff they of course do not, because that costs $$$, they rather shift that cost onto customers.

          Meanwhile in my ~20 years of using co-location I have witnessed one facility failure(power outage due to poor maintenance), and we moved out of that facility shortly after(company was hosted there before I started in 2006). That facility suffered a fire about 2-3 years later. Customers had plenty of warnings (3+ power failures in prior years to the fire) to leave. There are a TON of facilities I'd never host critical stuff in(probably 60-75% of them), even the facility I use to host my own personal gear in the Bay Area (which has had more power outages than my apartment over the past 7 years but for my personal stuff given the cost it's not a huge deal).

          My favorite facility at the moment is QTS Metro in Atlanta(look up the specs on the facility it's just insane the scale). Been there over 10 years not a single technical issue(not even a small blip), and the staff is, I don't have words for how great the staff is there. Maybe partially an artifact of being "in the south" and perhaps more friendly but they are just amazing. Outstanding data center design as well, 400-500k+ of raised floor in the facility, N+1 on everything, and nice and clean. I put our gear in there while it was still somewhat under construction.

          By contrast my most hated facility was Telecity AMS5 in Amsterdam (now owned by Equinix). I hated the facility so much(had to put "booties" over your shoes before walking on the raised floor WTF), and I hated the staff even more(endless stupid, pointless policies that I've never seen anywhere else). Fortunately we moved out of that place years ago (before Equinix acquired them).

    2. Roland6 Silver badge

      Re: Ironic

      Not sure if water cooling was the primary suspect. Given what we know about the ventilation design, an increase in humidity could have its origin in a change of weather.

  5. Roland6 Silver badge

    No video fire/smoke detection then?

    It would seem that whilst OVH had video surveillance, they hadn't partnered it with fire/smoke detection software - I wonder if the fire had been drawn to operators, attention earlier whether things might have turned out differently.

    What is also not clear (not fully translated the report) is whether the standard procedure was to simply call the fire brigade and evacuate, rather than investigate.

    1. Yet Another Anonymous coward Silver badge

      Re: No video fire/smoke detection then?

      > standard procedure was to simply call the fire brigade and evacuate, rather than investigate.

      Email the fire brigade, put in a request for extra "being on fire allowance", go on strike

    2. entfe001

      Re: No video fire/smoke detection then?

      The original french report notes that the detection and alert to the fire brigade was properly swift. They received the alert, they promptly alerted the emergency services after confirming it was real and left for their safety. While they were only three people on site, their proper procedure allowed for not having to search for potentially missing people, saving themselves and emergency services who had no need for a search and rescue operation. Also, they properly assisted the fire brigade on whatever they were able to.

      The main problems where the electric isolation of the site, which could not be properly achieved well until two hours had passed and SBG2 was already a total loss, and the lack of proper water supply, either by the nonexistent internal fire suppressors or by the on-premises emergency water supply, which was under-performing. Fire could not be controlled until a large water barge arrived at 3am, which happened to be on Strasbourg port at the time. If not for that barge, the whole site would had been lost to the fire.

      For the electrical isolation, they had to make sure all redundancy systems were either turned off or depleted, and that's what caused the massive delay before intervention: first, external power supply had to be cut off, which was not possible because the in-site breakers were too close to the fire and were thus not safe to operate, and the electric company had to be called to cut off supply to the area; second, the emergency diesel generators needed not to be started at all costs, overriding its default setup; third, as batteries could not be removed nor isolated they had to wait for depletion, and there was not an accurate estimate of how much time it would take to. All this waiting allowed the fire to develop.

      This proved that the site was, after all, quite well designed against power loss. The problem was, precisely, that a total power loss was actually needed for the fire brigade to act.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like