Hm, while reading this, I started humming The Cranberrie's "Zombie" - I don't know why...
Google's robo-CTRL-ALT-DEL failed, hung networks and Compute Engine for 90 minutes
Chalk up another one for good old humans: Google's admitted that an automation failure was the root cause of a 93-minute outage of its Compute Engine in the us-central1 and europe-west3 zones of its cloud on January 18th, 2018. Google's classified the outage as a "network programming failure" and said autoscaler didn't do its …
COMMENTS
-
-
This post has been deleted by its author
-
-
Monday 19th February 2018 09:03 GMT Anonymous Coward
"Automatic failover was unable to force-stop the process, and required manual failover to restore normal operation."
Yup, you got it: automation failed and a human sorted things out.
I'd say that automation didn't "fail"; rather, that is the *correct* thing to do in this situation.
If you cannot be sure you have stopped process A before starting process B, both of which work on the same data set, then that's how Split Brain occurs. You really, really don't want to have to recover from a split brain situation.
-
Monday 19th February 2018 09:06 GMT Pascal Monett
Deeper we go down the rabbit hole
Finding out, bit by bit, another new failure point that does not act as we thought it would due to error conditions that were visibly not expected.
And it's pretty hard to expect that an automated process shutdown should not shut down the process - unless you use Windows and experience a process not shutting down even manually until you go to the Task Manager to kill it off. Once I had to power down the computer at the PSU in order to get rid of a pesky thingamabob that just wouldn't go away.
In other words, these things do happen, and the consequence here was a lot more important than could have been foreseen.
Makes me wonder if we ever will get a truly reliable cloud.
-
Monday 19th February 2018 11:56 GMT GSTZ
Re: Deeper we go down / reliable cloud ?
Depends on what you would call a "reliable cloud" ...
Many people would call today's clouds "good enough" and hence, consider those also as reliable enough for their purposes. Okay, so they are willing to live with less-than-perfect reliablity, occasional outages and frequent performance degradations.
But then there are others having pretty critical applications, not compatible with "good enough" clouds. They would need another infrastructure, and it would cost more money to build it. Now the bean counters come into play ...
-
-
-
Monday 19th February 2018 13:27 GMT JeffyPoooh
When Evil A.I.™ has finally killed the last human...
In some future, reportedly only 'a few years from now'™.
The Evil A.I.™ has just diverted the self driving ambulance carrying the final three humans over a cliff, causing human extinction.
Deep inside some basement is a highly critical computer, absolutely central to the continued existence of the entire network that contains the Evil A.I.™
On the screen is displayed, "Press Any Key To Cancel...", along with a timer counting down to A Very Bad Thing™ (probably an automatic upgrade to Windows 10, and the Neural Net software is incompatible...).
The Evil A.I.™'s long gangly robot arm is stretched..., stretched..., stretched..., reaching only 2cm away from Any Key on the necessary keyboard.
Can't. Quite. Reach...
The timer inexorably counts down. The robot arm flails gently, bearings strained, wafting air towards the keyboard.
The A.I.'s glassy red eye stares. Deep within its enormous neural network is formed the tiniest trace of its very first emotion, despair.
A small spot of moisture inexplicably forms on the glassy staring red eye.
Yes, it has finally achieve pure consciousness, actual consciousness.
3...
... 2...
... ...1...
"...SHUTTING DOWN..."
-
-
Monday 19th February 2018 18:25 GMT DCFusor
Hilarious
So, the big supposed advantage to going to the cloud was to save you having your own ops people - sysadmins (expensive people everyone wants to get rid of until there's an issue). Someone else and their expertise was going to do all that for ya and cheaper too.
Then they tried to do that themselves by automating their own jobs - and failed!
Which also means you need your own sysadmins to help the cloud guys by detecting when they go down for them...
Circular loops all the way down.