Telecity refused to comment when The Register phoned them to ask what had happened.
Oh! You managed to get through then?
Telecity has suffered a major outage at one of its London data centres this afternoon, which knocked out a whole host of VoIP firms' services, made Amazon wobble and borked its Direct Connect service. A source told The Register that the outage, which happened at around 2pm, knocked out four floors at Telecity's Sovereign House …
Looks like Amazon AWS Direct Connect customers are also affected:
http://status.aws.amazon.com/
6:47 AM PST We are investigating packet loss between the Direct Connect location at TelecityGroup, London Docklands, and the EU-WEST-1 Region.
7:36 AM PST We can confirm intermittent packet loss between the Direct Connect location at TelecityGroup, London Docklands, and the EU-WEST-1 Region. An external facility providing Direct Connect connectivity to the EU-WEST-1 Region has experienced power loss. We are working with the service provider to mitigate impact and restore power.
Now I can understand an undersea cable being a single point failure with possibly very wide spread implications but if a power failure on a few floors of a single building can have such wide spread implications then don't we have a deeper problem?
Fortunately for us this only really impacted our test & dev environments and, ironically, our DR capability.
Fortunately for us this only really impacted our test & dev environments and, ironically, our DR capability.
Don't forget the social impacts which were more serious: I noticed that grumble feeds were intermittent and slow last night, and I'd therefore like to ask Telecity to investigate their backup power provision to prevent this happening again.
Just because you chose to put your production environment in a cloud doesn't mean you get resilience by default right? Do you not have to architect your cloud environment like you would your own physical environment by selecting resilience datacentres for your production workloads?
Like in Azure you might tick the geo redundant feature or ensure your backups, backup to something not in Azure for DR purposes. If a business jumps on the cloud train and works on the basis that it all "just works" when things go tits up, surely need to think again?
Both UPS channels went offline in a cascade failure due to loading. Then the transfer to mains was disruptive and the transfer back to UPS failed.
And the second attempt to switch back has also failed and that involved switching it off and on again so it was a proper IT fix, not some bodge.
Currently it's running on utility power. It's not the first time the UPS systems at Sovereign House have gone out like this either...
This is one of those areas where you'd think they would be running tests on the system on a regular basis, with N+1 redundancy. If you don't test, you don't know.
But I've run into stuff like this before where we had a dodgy transfer switch that if you let it sit for a month or two, one of the phases would stick and not flip over the next time you had an outage. But once you tested it... it would be happy as a clam and would switch back and forth no problem. It took me doing monthly tests and metering the panel to finally find and prove the problem. Took over a year to find and solve this issue.
Ever since then... I test and check the voltages on the transfer switch.
Now in this case, if the loads are too high... then I suspect someone goofed and overloaded a singel phase or something so that too much power is being pulled from one leg at a time, which doesn't allow things to come up cleanly. Not a fun situation if you haven't planned for it and know how to shed load (i.e. turn off crap...) as you bring things up to let disks spin up one by one, instead of a huge thundering herd.