Future
If we don't upgrade our grid and infrastructure, the future is clear:
- Our database cluster is down!!!
- Oh where?
- In the UK region.
- What time is there?
- It's nearing 5.
- Damn, they put their kettles on!
Microsoft techies are trying to recover storage nodes for a "small" number of customers following a "power issue" on October 20 that triggered Azure service disruptions and ruined breakfast for those wanting to use hosted virtual machines or SQL DB. The degradation began at 0731 UTC on Friday when Microsoft spotted the …
Thing is all the power companies are privatised now. Meaning the shareholders get paid first, and improvements are only achieved by going to government with the begging bowl in hand.
Which is much the same as before they were privatised, but now they're more "efficient". For the shareholders, obviously, not the customers.
I wonder if this explains the extreme lag and multi-second freezes on our Azure VMs on Friday, and the need to get one of them rebooted by our overworked hardware guys, as the remote desktop session was stuck on saying "Please Wait"
If these go down completely, then at least I have an excuse to stop working, but intermittent slowdowns are just like water torture, where I have to slow my brain down to the speed of someone in sales and marketing, so that the connection has time to handle the mouse clicks.
We had a VM go down, Azure support were non-existent, a manager emailed after some time to say they had no one available to help, this went on for most of the day, I gave up at 7pm on Friday night.
In the end I found a fix. The VM would not start because the load balancer's IP address was marked as already in use by the downed VM itself.
After some screenshots of the load balancer I deleted it and recreated, and the VM finally started. Two minutes later I finally got a call from Azure support, so I compared notes and closed the case.
"Due to an upstream utility disturbance, we moved to generator power for a section of one datacenter at approximately 0731 UTC. A subset of those generators supporting that section failed to take over as expected during the switch over from utility power, resulting in the impact."
Shouldn't these storage nodes be designed to self recover in the event of an “upstream utility disturbance”.
To speculate: The diesel generators failed to kick-in as they'd been contaminated with water. Water collects in the bottom of the fuel tank and when it's level rises to the level of the fuel take-off pipe, the water gets sucked into the diesel engine. I guess no-one was tasked with regularly checking the tanks. Generally uninterruptible power supplies (UPS) have about five minutes of battery power before the generators kick-in. Maybe this should be extended to a good few hours or a whole day.
"Decades of innovation, investment and better management mean that, overall, critical IT systems, networks and datacenters are far more reliable than they were."
If you knew how unreliable critical infrastructure really was you wouldn't sleep nights. Precisely down to lack of investment. You end-up with something like this:
The passenger info boards went down in a major airport cause a digger dug-up the fibre cable to the nearest cloud provider. Why didn't they have a back-up cable you may ask. They did, except the back-up cable went through the same pipe as the primary.
I think it's simpler. I bet the power control systems are based on Windows, and somewhere a modal popup window had appeared, as usual hidden BEHIND all the windows on the screen (beause that's where it can do the most damage) which basically stopped the controls from working until someone found it and clicked. As it's Microsoft, it was probably over something useless like 'do you want to see more advertising?'.
Or the controls were not updated and it got stuck on Windows Vista's "You moved the mouse. Accept Y/N?"
The physical aspects are fixable. Microsoft's products are not.
"Maybe this should be extended to a good few hours or a whole day."
There was a story on El Reg sometime over the last week or three saying MS or AWS was looking at replacing diesel gennys with larger battery banks for their backup UPS. Not a bad idea in the face of it, but a battery backup will have an extremely sharp cut-off point when the batteries run out, whereas a diesel genny can be refuelled indefinitely in the case of a prolonged outage.