that spinning flywheel...
...which provides power during generator startup is the same system that chernobyl was testing on the night of the accident with similar results it seems!
Amazon Web Services has explained the extended outage its Sydney services suffered last weekend, attributing downtime to a combination of power problems and a “latent bug in our instance management software”. Sydney recorded over 150mm of rain on last weekend. On Sunday the 5th the city copped 93 mm alone, plus winds gusting …
That type of system is fairly common. The breakers used are usually the sources of the problem. If they arc on opening and the arc doesn't break quickly, the power does flow back onto the grid. Humidity is not a good thing with those breakers and many times they are outside the main building or in the generator shed where humidity isn't controlled.
OTOH, watching them arc is impressive and spectacular.
The article seems to gloss over this, but backfeeding power into the utility grid is going to make people very upset. And possibly dead. If your disconnect/transfer switch device is known to be problematic, why was it allowed to be installed in the first place?
It's usually "profit" or its cousin "economising on options"
There is a really good reason to ensure that the system you specified or the system that's been ordered (accounting types will go X=Y so Z(*)) and actually delivered (**)
(*) We were lambasted over the price (and quantities) of tape being consumed in IT and asked why we couldn't use cheaper alternatives with all purchasing ability blocked until this was resolved. The answer was that we'd looked at the suggested products from Sellotape, but concluded that they would have an unfortunate tendency to gum up the tape drives.
(**) A classic case being the Quantity surveyor who decided a building was massively overengineered, and so deleted much of this extra cost without referral back to the customer. The result was a purpose-built city library building that didn't have floors strong enough to hold bookshelves on its upper 3 (of 5) floors.
In our experience (filthy power is the norm in SE england) It's best to interpose the flywheel permanently between the source and load, else normal day-to-day glitches and spikes will cause random trouble.
That way you gain the benefit of 100% conditioned power hitting the datacentre (no spikes, etc) and you don't need to worry about breakers causing outages (although it's happened here on several occasions as the system has been switched to pass-through for flywheel maintenance. The Caterpilar/Standby Power Systems setup is pretty shitty overall, but still better than most of the rest)
This doesn't help if the diesels don't start - which has also happened here thanks to helpful people in the organisation economising on a £700k purchase by deleting a £500 redundant starting option.
Of course if you're 100% serious about your power you run several flywheels and generators in N+1 parallel configuration. This allows you to switch out 1 flywheel or 1 diesel for maintenance and still have the capability to ride a power outage. Phase coherency is a long-solved problem.
No doubt the breakers were well tested. In Nevada. humid ? Sydney ? Never </sarcasm> Given the weather bureau were predicting accurately the size of that front and its probable arrival times and effects, why weren't diesel gens up and running with the tanks topped up ? Lots of stories about shiny generators without fuel when bean counters save money.
No-one in their right minds trusts the crap that is Oz power distribution, not to mention the compulsive planting of trees where they can do maximum damage in 30 years.
"Service unavailable" was your tag line for the few months you were in business. "I'm sorry, we can't find your order, our system's down again", was your unhappy customer service team's mantra to your dwindling and increasingly infuriated customer base. But at least you saved some money not buying your own machines... Oh wait. You were with Monster Cloud and your prices went up by 5,000% overnight! Each extra customer was actually costing you.
You still wonder where it all went wrong as you serve up yet another sugar loaded coffee perversion to a skinny Apple Watch wearing man-bunned hipster in Starbucks. The only thing in the cloud now is your head. How could the consultants have lied to you like this... How are you going to pay the bills?
"But I was in the cloud", you sigh. You weren't really. You just gave away the core of your business to someone else. And paid them handsomely to not care about it nearly as much as you did.
With bad weather forcast some time beforehand, would it have been hard for AWS to have one generator actually up and running in advance?
Would have helped avoid this outage and also provided a test for the UPS system.
If I had stuff on AWS, I'd be spitting chips over this, if the outage was indeed due to a UPS issue. But if I were on AWS, I'd also have systems ready in another availablity zone to take over should one go down.
Their design philosophy is that servers are homogeneous and "cattle not pets." Cattle don't need dual power supplies and dual power feeds - they're expected to be frequently slaughtered and replaced. But the whole infrastructure is a set of least cost dominoes with lowest price components. Cloud services are built cheap and easily recover from failures, but they're also built to fail - "cattle not pets." And sometimes it takes a while to get a new herd settled in.
Cloud service providers are like street vendor food - cheap, usually easy, and a lot better than nothing. But they aren't the most satisfying meal and sometimes you end up down and out the next day.
This outage is actually significant. It is now obvious that AWS has a design problem with its power systems. They know their data centers can't go down from a power loss. Bad for business and bad for customers - probably in that order. With all the money spent and time invested in getting the best of the best for their power systems, they still failed. They still failed. And it isn't the first time. AWS now has a trend of power failures and is the only major cloud to have that trend. Experts can recommend solutions that include multiple regions or cloud vendors to avoid application outage, and that's probably good advice, however, it significantly adds to the cost of the overall solution, is a bitch to test and manage and it may still go down for a wide variety of reasons that are impossible to test for across multiple regions or cloud vendors. Sticky wicket.
Let me try decipher that marketing speak.
They were trying to find Tier 3 datacenter that can accommodate their growth. There's plenty of them in Sydney. So what they were really trying to do is find cheaper datacenter than Equinix. Failing to find one they decided to build their own which crumbled at first power outage.
Seriously how AWS consider their Sydney site of acceptable standard when it was stated it is powered by one utility provider? It sounds like they saved money on that but put in proper internal redundancies gambling the backups will work.
For a normal decent data centre I'd expect power inputs from multiple substations and not uncommonly more than one utility provider. Costly but that's how you avoid complete outages on a big scale.
Biting the hand that feeds IT © 1998–2022