SourceFource
This week's edition of "Things you're surprised to find out still exist!"
A crippling data center power failure knackered SourceForge's equipment yesterday and earlier today, knocking the site offline. The code repository for free and open-source software projects crashed yesterday morning (around 0645 Pacific Time) after unspecified "issues" hit its hosting provider's power distribution unit, …
Do you live under a rock? A lot of people rely on Sourceforge, for good or bad. Portable Apps is but one example.
There are a lot of people that used to rely on sourceforge, but have stopped using it as it has been banned with their current employment. I think the main reason a lot of people have stopped using it maybe because sourceforge at one time started purposefully including spyware and viruses with their executables in order to "make" extra money. Those of us that stopped using it are not aware if they changed their deceptive practice.
"Those of us that stopped using it are not aware if they changed their deceptive practice."
They've changed ownership since then, and the new ownership said none of that would be going on anymore. So far, that seems to be true. I no longer have to avoid Filezilla, for example.
Hmmm - not for me. Got kicked out in the original outage. Now seeing a "404 File Not Found" on my browser tab and a "503 - Service Offline" message on the page when I try to log on again. Message follows:
"Slashdot is presently in offline mode. Only the front page and story pages linked from the front page are available in this mode. Please try again later."
> "we had already decided to fund a complete rebuild of hardware and infrastructure with a new provider"
They've completely missed the point.
Right answer: "we have decided to rebuild with TWO new providers, so if one data centre goes down, we just switch over services to the other one"
Luckily Sourceforge downloads were always mirrored. Still are I believe.
But last time I looked I had to fight through a barrage of JS/Ad farm mirror redirector pages, all refusing to give me a direct link.
Having a mirror for people to access critical projects = good.
Having ad bloat, tracking, JS, redirecting nonsense in front of your mirrors, that goes down when your site is down = poor. Really really poor.
Anyone shopping for DC space should ask the proprietor when they last randomly flipped the master breakers with no advance notice[1] to test the auxiliary systems. This isn't because you expect an answer, it's for the amusement value of watching the Facilities guys turn grey in ten seconds.
Untested business continuity procedures are obviously likely to be worthless, but in fairness to the guys on the ground, actually running a test is likely to end your career. Identifying a critical weakness in the DR plan will not protect you from PHBs whose bonuses are linked to uptime metrics. This is why you hear of generators with no fuel, auxiliary power units that fail in seconds because the fuses have evaporated and 3 phase switch overs so wildly unbalanced that the upstream systems shut down.
[1] "Scheduling" tests never works because Operations will subvert you by shifting the workload elsewhere. That causes the servers, fans and CRAC units to idle which means the power load you're switching won't be representative of a real failure condition.
Re: doing an unscheduled test with no advance notice: eventually you *will* face this sort of test, whether it's your hand on the breaker, or the hand of Chaos. Given that this is the second time that SF has failed the test, it's hard to blame anyone other than SF themselves.
As you say. you don't do unscheduled random failure tests unless you want to be fired. What you do do is a hell of a lot of prep work to work out what should happen in a certain scenario and then plan test of that at the least disruptive possible time for the business with all hands on deck to clear up the mess when it goes to shit (it will - which is the whole point - so you can fix your plan).
In an ideal world (with bottomless pockets) you do this in a test environment that replicates Production, but few of us have that much money.
==
What surprises me in these news stories is the number of services for IT professionals that don't have a DR. The story is about getting their primary infrastructure back up, rather than their fail-over to secondary. But then you get what you pay for I guess.
@Mark 110
I agree. The only place I've seen do proper tests on a regular basis was military.
What you describe is the best that can be achieved in the commercial world, with the caveat that scheduling things at the least disruptive time for the business will often tend to invalidate tests because the least disruptive time is usually the same as minimal loading. The fact you can switch the Amazon purchasing DC to "B" feed at 2 am on a random Tuesday does not mean you can do the same in the middle of "Black Friday" and it is under maximal load that failure should be anticipated because that's when everything is as hot as it's going to get and your mechanical components (e.g. CRAC units) are most likely to lock up and start a failure cascade. Faking full load with dummy processes (assuming Ops even have the capability) is only a partial solution because of thermal inertia.
As for DR sites, I think the main reason they are avoided is that even if Facilities hands over correctly, Ops won't. The network probably won't re-route properly, and even it it does you end up with dangling partial transactions in the storage and database systems, a nightmare job reintegrating the datasets afterwards and inevitable data loss because there is so much lazy writing, RAM buffering and non-ACID data (I'm looking at you, Riak) floating about in modern systems.
It's often the things you don't expect. I was in a data center that ran fine for three years, then one day we suddenly lost one of our two power feeds in our rack. (Naturally it was Christmas week and everyone was on vacation.) Turned out someone had forgotten to tighten a nut on a connection in the breaker box, way back when the center was built. Things were fine as long as the row of racks it fed was mostly empty...but when they finally got around to filling it, the extra current caused the high-resistance connection to melt down. Unfortunately this was the part of the power distribution system between the UPS and the servers, so the UPS's didn't help. I believe they started doing regular IR scans of the breaker boxes, after that.
I'm not sure anything will ever top the story I heard about a data center that did regular generator tests, always successful, but the generator failed after a few minutes when there was an actual power outage. Turns out no one had ever noticed that the fuel transfer pump was only wired to utility power...
And for years, apparently. I recall reading a similar "fuel pump not on the right side of UPS" story in Lessons Learned (or not) from the (1965) Great Northeast (US) Blackout. Also similarly about a sump-pump in one hospital being considered "not critical", at least until seepage from the nearby river rose to the level needed to short out the generator in the basement.
OTOH, there were rumors of a surge of births nine months later, although it's hard to imagine losing access to SourceForge and /. would have that effect.
"Scheduling" tests never works because Operations will subvert you by shifting the workload elsewhere. That causes the servers, fans and CRAC units to idle which means the power load you're switching won't be representative of a real failure condition.
I can well believe that.
Related, and seen somewhere on Youtube recently:
"Staff will treat Penetration Testers the same way as they do auditors. The natural inclination is to hide anything embarrassing; they won't tell them everything."
Facilities tests are almost always run by Facilities people who have a vested interest (and therefore a cognitive bias) in successful results. The Military case I mentioned before was more like a penetration test. The resiliency team *delighted* in failure - they weren't trying to prove the systems worked, they were trying to break them. That shift in perspective can dramatically change the results.
I remember on of our staff doing a pull the plug test they were specifically told not to do and doing 10 grands worth of damage to test systems. Test had been done before.
Trouble with disaster recovery tests is you don't want to test what would happen to your datacenter if someone took a baseball bat to the cage on the left by actually doing it.
[1] "Scheduling" tests never works because Operations will subvert you by shifting the workload elsewhere. That causes the servers, fans and CRAC units to idle which means the power load you're switching won't be representative of a real failure condition.
Very true, though scheduled tests can be useful for doing things like running the fuel store down on a regular basis so that when you pull the breakers on an unscheduled test (or an actual failure) your tanks aren't just full of sludge where the diesel used to be.
...scheduled tests can be useful for doing things like running the fuel store down on a regular basis so that when you pull the breakers on an unscheduled test (or an actual failure) your tanks aren't just full of sludge where the diesel used to be.
That actually happened to a hospital in the rural Michigan town I used to live in. It was later determined that the maintenance staff had been pencil-whipping the generator tests for years.
It can be worse than ending your career. if you're a systemically-important financial institution then a failed DR test can quite plausibly crash the economy. So, not surprisingly, they never get done: they do DR tests but they are very carefully rehearsed events, usually of a tiny number of services, which don't represent reality at all.
The end result of all this is kind of terrifying: in due course some such institution *is* going to lose a whole DC, and will this be forced to do an entirely unrehearsed DR of a very large number of services. That DR will almost certainly fail, and the zombie apocalypse follows.
Does it still make economic sense to buy space in a data center for a workload like sourceforge or slashdot? I would have expected a cloud service like AWS and Azure to be a more (logical|scalable|inexpensive) choice.
(Full disclosure, I work for one of those cloud providers.)