back to article OVH blames hour-long global outage on human error during 'routine' network reconfiguration

Hosting provider OVH reckons that its servers are "gradually returning" following a worldwide outage that lasted for around an hour. The French co-location biz's services went AWOL from 7:20am UTC. The company had earlier talked of planned "maintenance on our routers" in its Vint Hill, Washington DC, data centre to "improve …

  1. Adam JC

    I still find it mildly horrifying that a single configuration change by one engineer can bring an global network as vast as OVH's. You'd think there would be some safeguarding in place, alas.. not.

    1. seven of five

      Single configuration changes by one engineer brought down networks vastly larger than OVH (Micros~1, Amazon, IBM, ...), some changes just can not be isolated.

      1. Peter-Waterman1

        From what I understand from Amazon is there are no dependencies between regions. Therefore, an incident in one region is unable to affect another region. EG, each region has a console service to log on to etc. The worst outage I have seen was the Amazon S3 outage four years ago, but it still only affected one region.

        Different from Azure, where South Central AD took out global services, and different from OVH who seem to be in all sorts of hot water when it comes to running public cloud.

        While I am loathed to pay Jeff Bezos any more cash, they do seem to be able to run a stableish cloud.

        1. Peter-Waterman1

          Looks like a global dependency on Azure AD may now be fixed...

          Resiliency: Microsoft has made concerted efforts to improve resiliency with critical services such as Azure Active Directory, but many Gartner clients remain concerned about the real-world impacts when such critical services are unavailable. Further, Microsoft continues to react slowly to the rollout of AZs with the likelihood that some regions will never be equipped with such resiliency capabilities. Services such as the Azure Kubernetes Service (AKS) continue to experience some outages, particularly in association with updates and maintenance events.

    2. Xalran

      consider this :

      pluging two unconfigured switches ( that are interconnected ) to two backbone switches, can bring down the whole backbone for hours, do that on a $TELCO backbone and you basically have no phone until they unplug at least one switch from the backbone.

      Yes it occured ( more than once ), no I wasn't involved into it... ( I jumped ship to another branch a few months before it occured... and yes it would make a great Who, Me?... but it's not old enough yet.)

      Yes it will occur again.

  2. Dr Who

    All talk, no trousers

    Ms Thunberg might have a thing or two to say about this and the FB outage :

    Change control - blah, blah, blah

    No single point of failure - blah, blah, blah

    Systems engineering - blah, blah, blah

  3. Anonymous Coward Silver badge

    "improve routing"

    I think I can speak for the rest of the internet when I say that OVH dropping offline improves routing for the rest of us. And vastly reduces the DDoS attempts we have to fend off.

  4. mark l 2 Silver badge

    Strangely my website uptime monitor at Uptime robot shows no downtime for my OVH VPS for the last 24 hours. So it obviously didn't effect all of their services.

    1. Anonymous Coward
      Anonymous Coward

      I can assure you, we have at least 1 VPS in every region OVH serves and the entire network was down. It was GLOBAL, despite what your monitor was saying.

      It was only a network outage, all the VM's, servers and VPS's remained powered on throughout FWIW.

  5. h3nb45h3r

    Bloody intern....

