
It's not just Oracle
I'm at the brown end of this and can tell you that (at least) Microsoft Azure and SAP have also been impacted by this outage.
It has provisionally been attributed to a cooling failure caused by a power surge. (this may change)
Oracle, Netsuite, and Microsoft's clouds have gone down, hard, in the Sydney, Australia, region likely due to an issue at a datacenter provider in which both are tenants. The Big Red Cloud first advised customers of an outage at 2129 Sydney time (1229 UTC) on Wednesday, and 29 minutes later wrote to inform customers that the …
> We should move one monitoring systems to a different DC
No, you should be running in more than one DC, or at least more than one AZ in the DC.
A different DC will have its own set of vulnerabilities to an outage. As long as you are in a single AZ you will go down with that AZ.
The Big Red Cloud first advised customers of an outage at 2129 Sydney time (1229 UTC) on Wednesday, and 29 minutes later wrote to inform customers that the outage had started earlier than its first emailed advisory – at 1015 UTC.Well, not for those who host their email in those data centres.Oracle’s second email delivered the mixed message that: “We are still investigating an issue in the Australia East (Sydney) region that is impacting multiple OCI services. We have identified root cause of service failures and are working to mitigate the issue.”
At least Azure Availability Zones work as documented (unlike Google's) - all the impacted VMs in AU East I'm responsible for all failed over correctly to one of the other functional AZs.
It seems most of these outages are due to design and cost choices at the customer layer rather than a complete failure of cloud substrate. If as a customer you're not considering a DC as a point of failure then you will have an unplanned outage at some point during your operations.
"proactively powered down a small subset of selected compute and storage scale units, to avoid damage to hardware."
So the combined penalties that Azure might fork out for an outage on their cloud is less than the cost of a selection of compute and storage. Nah. I reckon they powered down that hardware because the data on that hardware might have become irretrievable. Hardware can be replaced. Data is a different liability altogether
Sounds like we haven't heard the whole story. Mind you after Azure token signing keys were exfiltrated earlier this year, we are getting used to not hearing the whole story from Microsoft.