I still find it mildly horrifying that a single configuration change by one engineer can bring an global network as vast as OVH's. You'd think there would be some safeguarding in place, alas.. not.
OVH blames hour-long global outage on human error during 'routine' network reconfiguration
Hosting provider OVH reckons that its servers are "gradually returning" following a worldwide outage that lasted for around an hour. The French co-location biz's services went AWOL from 7:20am UTC. The company had earlier talked of planned "maintenance on our routers" in its Vint Hill, Washington DC, data centre to "improve …
COMMENTS
-
-
-
Wednesday 13th October 2021 11:35 GMT Peter-Waterman1
From what I understand from Amazon is there are no dependencies between regions. Therefore, an incident in one region is unable to affect another region. EG, each region has a console service to log on to etc. The worst outage I have seen was the Amazon S3 outage four years ago, but it still only affected one region.
Different from Azure, where South Central AD took out global services, and different from OVH who seem to be in all sorts of hot water when it comes to running public cloud.
While I am loathed to pay Jeff Bezos any more cash, they do seem to be able to run a stableish cloud.
-
Wednesday 13th October 2021 11:43 GMT Peter-Waterman1
Looks like a global dependency on Azure AD may now be fixed...
Resiliency: Microsoft has made concerted efforts to improve resiliency with critical services such as Azure Active Directory, but many Gartner clients remain concerned about the real-world impacts when such critical services are unavailable. Further, Microsoft continues to react slowly to the rollout of AZs with the likelihood that some regions will never be equipped with such resiliency capabilities. Services such as the Azure Kubernetes Service (AKS) continue to experience some outages, particularly in association with updates and maintenance events.
-
-
-
Wednesday 13th October 2021 17:18 GMT Xalran
consider this :
pluging two unconfigured switches ( that are interconnected ) to two backbone switches, can bring down the whole backbone for hours, do that on a $TELCO backbone and you basically have no phone until they unplug at least one switch from the backbone.
Yes it occured ( more than once ), no I wasn't involved into it... ( I jumped ship to another branch a few months before it occured... and yes it would make a great Who, Me?... but it's not old enough yet.)
Yes it will occur again.
-
-