We all know....
... it was the iPhone.
to many people acessing their TypePad accounts at once.
Come on, Reg - you've worked very hard lately to build a rep as troll-like iPhone haters - run the story - you know you want to!
Some of the net's biggest websites went dark following power interruptions in San Francisco that are disrupted operations at co-location facility 365 Main. Websites for Craigslist, Typepad, Yelp, Livejournal and Redenvelope were all shuttered at the time of writing, and Sun's site was down earlier today. All these sites are …
OK, this one might not be their fault but this isn't the first time in recent months that SixApart have been unable to provide a decent service for their customers. They're still trying to clear up the mess from the LiveJournal Strikethrough 2007 (which was entirely their own doing) when this happens. Not the first time LJ has had major outage problems either.
Can't be good for the SixApart brand...
Also comes in useful for work stuff that's happening now, so thanks for posting this story.
Much of downtown SF lost power in the early afternoon. See the SF Chronicle article for details:
That does not absolve 365 Main for the failure of its backup generators to kick in. Improperly tested and possibly oversubscribed, mayhaps?
Having spent 26 years in IT, I know that the one thing that is absolutely certain is that no matter what disaster recovery plans you have, they will not work when a real disaster occurs.
After 9/11, we discovered that all the copies of the disaster recovery procedures were kept in the building next to the World Trade Center, the one we had just fled.
This has happened a couple of times to the hosts I use. It is always the backup generators and I think this becoming something of a joke. They are obviously not maintained properly and not adapted to cope with increased number of servers. If I am paying good money for hosting which claims to have backup generators then I can expect them to work during a power outage. Is that so difficult?
Actually, although it didn't take Second Life down completely, it left them with a lot of issues. Stipends were late, the grid was buggy, (More so than usual that is.) and there were significant issues with logging in.
I suspect that the reason they weren't down completely is because they have several facilities that aren't located in San Francisco.
I work for a major ISP and if there is an area wide power outage it won't matter if the companies UPS was working or not. Unless they have diverse connectivity, AND the none of the street cabs were affected also, then there's not a lot they could have done. If an entire district goes off. Fazal Majid appear to have overlooked this when he offers his critique of the situation, "Improperly tested and possibly oversubscribed, mayhaps?" read more like, I'm Improperly informed and under experienced...vinyl1 has his/her finger on the pulse, it's about time people were a bit more BIG PICTURE in their approach to resilient systems instead of just believing the hype and throwing blame around. All of everything is based on, over hyping, under providing, and under buying...
Get your spec's on guy's... that choccy lab is actually a Hungarian Vizsla - the smooth not wire coated type. Wonderful dogs... very friendly and make great pets. Daft as a brush though and like nothing better than to pretend they're about 5 times smaller than they really are and sit on your lap all night. All together now..... ahhhhhhhhh
@4:27: "one thing that is absolutely certain is that no matter what disaster recovery plans you have, they will not work when a real disaster occurs."
That's because too many people do their Business Continuity planning by starting with the Disaster Recovery, instead of the other way around. DR is only a part of BC, and unless you start by looking at the whole picture you'll never have adequate protection.
I have personal experience of an office shattered by a car bomb on a Friday afternoon. Everyone was back at a desk in a different building, with working phone and IT, by Monday afternoon. BC works, if you do it right.
To captain kangaroo :
Read the article mentioned by Fazal Majid
The power went down at 2pm, other sources mention 1.45, it was back by 4pm, 365 Main had about two hours with no power.
However, the sites dependant on 365 Main all failed shortly before 2pm.
Now, from a press release in March of this year :
To ensure uptime for all its tenants, each of 365 Main’s data centers features modern power and cooling infrastructure. The company’s San Francisco facility, for example, includes two complete back-up systems for electrical power to protect against a power loss. In the unlikely event of a cut to a primary power feed, the state-of-the-art electrical system instantly switches to live back-up generators to keep the data center continuously running and shield tenants from costly downtime.
In other words, they explicitly state that the power to them could be cut off and they will say running. Which was a lie, because they failed at the same time as their power supply failed. Fazal's comment was correct, they didn't test their equipment properly.
I work in a company that employs 12000 staff and 1000s of servers in our data center, the UPS should cut in immediately, with no downtime at all. The backup generators should cut in after a few minutes, and then continue running the data center for a few hours. It is incompetence on the part of 365 Main that this did not happen.
According to some rumors the outage was caused by a drunken BOFH running amok at 365 Main. Not likely true but it would have been hilarious if it was.
No "disaster plan" is foolproof, however. There's always a bigger fool to break whatever plan you have nailed down. Remember that when your datacenter breaks for some bizarre reason.
(1) Have received a letter from your datacentre stating that they will be failing over the mains supply to test the backup generators and UPS ?
(2) Who would allow them to fail the mains supply if there was the slightest chance the test could fail and you would lose your website ?
(3) Worked in a datacentre that would regularly fail the incoming mains supply to test the UPS and generators ?
Come on seriously ... these backup power supplies are never tested under full load because getting permission from all your customers will never happen. Image telling 20000 customers they might lose their website at a certain time if the failover to generators didn't work !!! I've know people to forget to fuel the generators from the last genny start up and once under load they only lasted 30 minutes before even they failed .....
Backup power requires a lot of thought and regular testing and problems only show up under full load (imho). People shouldn't expect 100% uptime and datacentres shouldn't tell people they can provide 100% uptime. People and equipment fail .... plan for it :)
I'll put my hand up here. Facility we use regularly tests the generators, including carrying the full building load. They also get load banked once a year to their full opperating capacity. If you pick a decent facility, you've nothing to fear (above what you have on any other day of the week) with them testing the failover mechanisms and systems.
There's 1 simple fact here nobody's paying attention to: The sites went down "at the same time" as the power outage. Even simple UPS systems would have kept running for 10-20 minutes, giving them plenty of time to bring generators online even if they didn't automatically kick in. What happen here is the power outage was multiple blocks, so although 365 Main was still up and running, the first link their router connected to (some hub down the street from the building that their parent ISP or telco runs) didn't have power.
I experience this regularly, even on my fiber-at-home connection. In some cases, I still HAVE power, but a junction box 3 miles from here doesn't, so i have no net or phone if that happens. If power outage is widespread (say from a hurricane) then every thing is out, cellular, terrestrial phone, internet, everything.
Unless someone can prove the systems went down (easy enough to do via event viewer or logging systems on the servers and the routers) then 365 Main is not responsible for this outage, and should not have to pay 365 uptime guarantees. They'd have an easier time suing the power company for lost revenue in that case.
"The power went down at 2pm, other sources mention 1.45, it was back by 4pm, 365 Main had about two hours with no power.
However, the sites dependant on 365 Main all failed shortly before 2pm."
Not sure of the point being made here... we know when they went down but thanks for the reminder...
Dare I say it but my company is bigger than yours, by about 50%. And we have 4 times as many data centres... but this isn't a cock contest.
Your misunderstanding smacks of a high level position in your company, conversely, comments like "Which was a lie"... smack of a low level understanding of the issue discussed... People don't lie, and it's not lying to transcribe from the installation manual, it's just not very savvy... but you knew that didn't you..
That statement is 100% correct, it's impossible to contact a full load test, no 24x7 site is going to risk any downtime for a test.
Also 365 main didn't go dark, not all the colo sites went offline, you are only hearing about a small fraction users who experienced issues. There is a much larger percent who did not have issues.
The generators did kick in, only not all of them did it perfectly, That said more of the colo sites within the building stayed on, then those that didn't.
The event that happened at 365 main is once in a blue moon event and though things could have been better, they could have also been a lot worst.
A backup generator is a big diesel motor. It's got a starter battery. It's got oil in the crankcase. If you don't test-run the thing every now and then, the oil all drips out and the battery runs down. Lots of time the generator lives outdoors where moisture and grit can get in. It still looks shiny though.
I worked for a company with backup generators and no plan to test run 'em. Sure enough, about once every 18 months we'd get a power failure and it would get as dark as the inside of a cow in that building. And not one time did the generator ever come on.
This is a lesson learned over and over by every company it seems. Don't the generators come with a DVD containing frisky corporate music and a plea to test-run the generators because otherwise they won't work when you need them?
I used to do this every Monday morning at 08:30 in a telephone exchange by tripping the mains circuit breaker. The engine always started a straight 8 Lister diesel and was left running the building right through to 13:00 or so. This covered the morning peak for telephone traffic.
A telephone exchange, as a piece of electronics, runs on a float power supply, with at least 2 rectifiers to charge a battery, which works as a UPS. Telephone exchanges do not fail very often because power supply is integrated into the design of the exchange. The power supply represents a single point of failure, having integrated redundancy and UPS ensures that problems do not happen. Should power really fail, customers with non urgent classes of service will be prevented from originating calls to preserve power.
Dual integrated power supplies with in built fall back facilities, that work.
Now that all the IT braggarts are exposed for what they are, incompetent buffoons, they have confused knowing a bit about software and knowing a few buzzwords with actually understanding what they are talking about.
Perhaps actually looking into a bit of electrical engineering and being ever so humble might well be the order of the day.
And the comment about disaster recovery, I worked for an FM company here and we tested out our DR plans every year. No one was allowed help from their colleagues, they had to follow the documentation and correct it if any errors were found. The trial simulated a real disaster and the guides were to enable anyone with a modicum of IT knowledge to recover the systems under test. Users were involved too, they were shipped in to test fully functional systems and sign everything off.
In fact, when relocating mainframes, I used to implement the DR process at the client site and use that for the system transfer. I moved 3 mainframes using this technique and every one went without fuss or problem. Why was this, because the DR system had been properly tested, was proven to work in annual trials and was relatively easy to implement.
I haven't named the company, as it may cause embarrassment to their competitors.
Maybe this is the rare occasion where big corporates can learn from Public/Government organisations. I was working for a UK.gov facility about 10 years ago, and they deliberately failed over to their backup generators once a fortnight for approx 20 minutes to ensure everything worked properly.
If the DC has a proper setup, I see no reason why they couldn't failover a small section to UPS/Generator backup once every month or two to check all is in order.........rotate through every section over the course of a day/week maybe
There is no reason why a datacenter shouldn't be able to promise 100% uptime if they have the proper testing procedures in place. I used to work in a datacenter for a company that processed mutual funds transactions, and every Sunday AM we switched from mains to battery for an hour to exercise the backup system. At 1am the lights and air conditioning would shut off, something I never got used to, but the systems hummed along. Then the generators spun up and the datacenter came back to normal. An hour later we'd be back on mains. Not once did the systems fail.
... IFF you design the system right.
If you design the system so that you can start the genny and sync it to the mains then you can safely test it under full load. You simply start it, sync it, then turn up the regulator until your substation metering tells you that your aren't pulling power from the mains. There, running on the genny.
If you set things up to allow it, you go a bit higher and export a bit just to be sure - though I'm sure your local utility company will try and persuade you not to do that.
Of course, you still need to pluck up the courage to open the main switch at some point to test that you can still run as an island !
365 Main does quite a bit of boasting on their website about their fancy electrical infrastructure, including a 35kv feed and in fact, no UPS's.
What they use is a device from a company called "Hitec" which is essentially a mechanical stored-energy device that has giant flywheels that are supposed to store energy long enough to switch over to the generator.
I have a client about a mile away from 365 Main and I got my first page about a power interruption at 1:46PM PDT. But perhaps what screwed-up these guys is that there was a whole slew of short interruptions of 1-2 minutes each. Subsequent pages show timestamps of 14:06, 14:08, 14:16 and 14:21.
Given that the flywheel will only keep things going for something like 5-10 seconds, you don't get much of a second chance if the diesel doesn't start up after the 2nd try. And with 5 power interruptions, that's potentially 5 times you get to test that in a row.
365 Main boasting about their electrical infrastructure:
Biting the hand that feeds IT © 1998–2022