Recovery
Recovery procedures. Ha!
Almost no company has complete ones, and my guess is, nobody gets to test them.
Last week I just suffered one of these situations.. and guess what.. no reliable backup, no good plans..
Amazon Web Services battened down the hatches and called in extra troops ahead of the vicious thunderstorms that ripped through the Ohio valley and across Virginia and Maryland on Friday night. But despite all the precautions – and there were many – the US East-1 region of Amazon's cloud was brought down after an electrical …
I remember a telco that used to do that. Every Monday the generators were started to make sure all was well. Then they were stopped. No-one considered that this was just like doing lots of short journeys in a car. When the big power cut arrived, the generators started perfectly. Then, after a few minutes, the hot and thoroughly coked-up cylinders started to misfire, and the engines packed up. It took a full cylinder head-off clean and rebuild to get them running again.
After that they were only tested once a month, and allowed to run for an hour or so to reach full temperature each time.
"The real story is not that a generator can bring down a chunk of a cloud, but that the recovery process when there is a failure is still hairy even after some of the best minds on the planet have tried to think of all the angles."
It's also the first time in human history such cloudy juggling at a large scale is being attempted. Just keep on trucking!
As for the 2012-leapsecond-linux-bug, I wonder how many Linux server are still running full blast in the various datacenters right now. I hear our hosting company noticed a fat uptick in Watt consumed when the bug hit.
This post has been deleted by its author
One of the problems of mega-big systems like this is that it's not economically-feasable to test them at full size and load - you'd have to duplicate the mega-big system to do that.
The reason such full-size and full-load testing is necessary, is to find bugs which only rear their ugly heads under high, complex loads.
I'm not criticizing Amazon (or Google, etc.) for this. It's simply the nature of mega-big systems.
Accordingly, when you're considering moving some-or-all of your company's processing and storage to a public cloud, you'd better figure in the cost of downtime to your company, along with a-higher-probability-of-outage than you'd ordinarily "think" would be likely.
Are there any studies showing that the public cloud goes down more often than private internal ones? Lots of internal networks are much less robust than Amazon's data centres.
When Amazon goes down for an hour, its news, when some 500 person company goes down for 12 hours, new of it never leaves the building.
Totally agreed; good advice.
One addendum: agreed you can't (economically) feasibly test with a production size test system. However, you can (and should) test various scenarios with smaller environments that might at least have brought up the mentioned software problems. (Of course, it might not as these problems may only occur in more massive test environments etc)
Which is what many large industrial customers do anyway. Then you use the grid for the times your power plant dies. You can make power for about the same cost as grid using 100% nat gas. Then you get 0.001% times 0.001% failure likelihood. Actually not that good, cause real big events might knock everything out, I guess.
The reliability of the North American grid is dropping over the next decade as more de-stabililizing wind comes on line. So its wise to plan for that now.
"Best minds"? Eh? Assumes facts not in evidence ...
Back when I worked for Bigger Blue (late 1970s), on Fabian Way in Palo Alto, we'd kill the mains power at 3PM on the last Friday of every month to ensure that the battery would carry the load long enough for the genset to warm up enough to take over. It ain't exactly rocket science. (In the event of failure, everyone went home early with two hours pay. It never happened while I worked there).
Keep in mind that this system ran the entire Ford Aerospace campus in Palo Alto, not just the computers. (OK, nearly the entire campus ... The fine folks in the Faraday Cage over on East Meadow Circle and across Adobe Creek on West Bayshore had their own solution ...).
"mains" or "genset" -> battery -> motor-generator -> building/campus power
My personal systems work the same way, and haven't been down since TehIntrawebTube's variation of "Flag Day" (January 1st, 1983).
I hear that natural gas fuel cells are getting good now, so they may make more sense than diesel or gas turbine generators, ... unless the gas is electrically pumped and you have no on-site gas tanks (preferably underground).
.. still with AWS -- nothing compares really with the maturity of the offer and price so far -- but we dont have any illusions.. we've backed up all our critical files in another side of the planet. We'd still be down if the worse happened but at least we have another set of really remote backup. Google Compute will probably take another 2-3 years to be 'good enough' to be even considered.
North America enjoys a pretty reliable power supply but when it goes down, it really makes a job of it.
I was in Toronto when the Northeast US and southern Canada suffered a blackout, a few years ago, that lasted up to 5 days. Generator sales went crazy, even the local butcher had two going to keep his refrigerators working.
Countries that experience failures on a regular basis are usually well prepared - with either portable or standby power units ready for service. High-rise apartment buildings in Ho Chi Minh City list their features such as swimming pools, sun decks and always emergency power is listed as a feature.
Many moons ago I worked a a technician at a Decca Navigator transmitter station and our station manager, the late John Pratt, always surprised us with his sneak power failures not only during the day but also in the middle of the sleeping watch at night. We had banks of batteries that carried the load whilst the generator started up.
When the real thing happened, our transmitter performed flawlessly.
Yep seen the same thing. After a test someone forgets to switch back to auto. Some time later a digger goes through some power cables and the entire area goes out. This is then very swiftly followed by a lack of power in the office... That's when you get to see how good your DR setup is!
"Impair" instead of "impaired" seems to be a typo, but the use of the present simple in a dependent clause to indicate and unchanging action seems correct to me.
In other words, "volumes that had in-flight writes when power was lost came back in an impaired state" indicates that the impairment happened just this one time and wouldn't normally be expected to occur, whereas "volumes that had in-flight writes when power was lost come back in an impaired state" implies that this is what is expected to happen every time.
I would agree, however, that having the whole sentence in present simple would have been equally correct and more aesthetically pleasing. At least to my ears.