
SourceFource
This week's edition of "Things you're surprised to find out still exist!"
A crippling data center power failure knackered SourceForge's equipment yesterday and earlier today, knocking the site offline. The code repository for free and open-source software projects crashed yesterday morning (around 0645 Pacific Time) after unspecified "issues" hit its hosting provider's power distribution unit, …
Do you live under a rock? A lot of people rely on Sourceforge, for good or bad. Portable Apps is but one example.
There are a lot of people that used to rely on sourceforge, but have stopped using it as it has been banned with their current employment. I think the main reason a lot of people have stopped using it maybe because sourceforge at one time started purposefully including spyware and viruses with their executables in order to "make" extra money. Those of us that stopped using it are not aware if they changed their deceptive practice.
"Those of us that stopped using it are not aware if they changed their deceptive practice."
They've changed ownership since then, and the new ownership said none of that would be going on anymore. So far, that seems to be true. I no longer have to avoid Filezilla, for example.
Hmmm - not for me. Got kicked out in the original outage. Now seeing a "404 File Not Found" on my browser tab and a "503 - Service Offline" message on the page when I try to log on again. Message follows:
"Slashdot is presently in offline mode. Only the front page and story pages linked from the front page are available in this mode. Please try again later."
> "we had already decided to fund a complete rebuild of hardware and infrastructure with a new provider"
They've completely missed the point.
Right answer: "we have decided to rebuild with TWO new providers, so if one data centre goes down, we just switch over services to the other one"
Luckily Sourceforge downloads were always mirrored. Still are I believe.
But last time I looked I had to fight through a barrage of JS/Ad farm mirror redirector pages, all refusing to give me a direct link.
Having a mirror for people to access critical projects = good.
Having ad bloat, tracking, JS, redirecting nonsense in front of your mirrors, that goes down when your site is down = poor. Really really poor.
Anyone shopping for DC space should ask the proprietor when they last randomly flipped the master breakers with no advance notice[1] to test the auxiliary systems. This isn't because you expect an answer, it's for the amusement value of watching the Facilities guys turn grey in ten seconds.
Untested business continuity procedures are obviously likely to be worthless, but in fairness to the guys on the ground, actually running a test is likely to end your career. Identifying a critical weakness in the DR plan will not protect you from PHBs whose bonuses are linked to uptime metrics. This is why you hear of generators with no fuel, auxiliary power units that fail in seconds because the fuses have evaporated and 3 phase switch overs so wildly unbalanced that the upstream systems shut down.
[1] "Scheduling" tests never works because Operations will subvert you by shifting the workload elsewhere. That causes the servers, fans and CRAC units to idle which means the power load you're switching won't be representative of a real failure condition.
Re: doing an unscheduled test with no advance notice: eventually you *will* face this sort of test, whether it's your hand on the breaker, or the hand of Chaos. Given that this is the second time that SF has failed the test, it's hard to blame anyone other than SF themselves.
As you say. you don't do unscheduled random failure tests unless you want to be fired. What you do do is a hell of a lot of prep work to work out what should happen in a certain scenario and then plan test of that at the least disruptive possible time for the business with all hands on deck to clear up the mess when it goes to shit (it will - which is the whole point - so you can fix your plan).
In an ideal world (with bottomless pockets) you do this in a test environment that replicates Production, but few of us have that much money.
==
What surprises me in these news stories is the number of services for IT professionals that don't have a DR. The story is about getting their primary infrastructure back up, rather than their fail-over to secondary. But then you get what you pay for I guess.
@Mark 110
I agree. The only place I've seen do proper tests on a regular basis was military.
What you describe is the best that can be achieved in the commercial world, with the caveat that scheduling things at the least disruptive time for the business will often tend to invalidate tests because the least disruptive time is usually the same as minimal loading. The fact you can switch the Amazon purchasing DC to "B" feed at 2 am on a random Tuesday does not mean you can do the same in the middle of "Black Friday" and it is under maximal load that failure should be anticipated because that's when everything is as hot as it's going to get and your mechanical components (e.g. CRAC units) are most likely to lock up and start a failure cascade. Faking full load with dummy processes (assuming Ops even have the capability) is only a partial solution because of thermal inertia.
As for DR sites, I think the main reason they are avoided is that even if Facilities hands over correctly, Ops won't. The network probably won't re-route properly, and even it it does you end up with dangling partial transactions in the storage and database systems, a nightmare job reintegrating the datasets afterwards and inevitable data loss because there is so much lazy writing, RAM buffering and non-ACID data (I'm looking at you, Riak) floating about in modern systems.
It's often the things you don't expect. I was in a data center that ran fine for three years, then one day we suddenly lost one of our two power feeds in our rack. (Naturally it was Christmas week and everyone was on vacation.) Turned out someone had forgotten to tighten a nut on a connection in the breaker box, way back when the center was built. Things were fine as long as the row of racks it fed was mostly empty...but when they finally got around to filling it, the extra current caused the high-resistance connection to melt down. Unfortunately this was the part of the power distribution system between the UPS and the servers, so the UPS's didn't help. I believe they started doing regular IR scans of the breaker boxes, after that.
I'm not sure anything will ever top the story I heard about a data center that did regular generator tests, always successful, but the generator failed after a few minutes when there was an actual power outage. Turns out no one had ever noticed that the fuel transfer pump was only wired to utility power...
And for years, apparently. I recall reading a similar "fuel pump not on the right side of UPS" story in Lessons Learned (or not) from the (1965) Great Northeast (US) Blackout. Also similarly about a sump-pump in one hospital being considered "not critical", at least until seepage from the nearby river rose to the level needed to short out the generator in the basement.
OTOH, there were rumors of a surge of births nine months later, although it's hard to imagine losing access to SourceForge and /. would have that effect.
"Scheduling" tests never works because Operations will subvert you by shifting the workload elsewhere. That causes the servers, fans and CRAC units to idle which means the power load you're switching won't be representative of a real failure condition.
I can well believe that.
Related, and seen somewhere on Youtube recently:
"Staff will treat Penetration Testers the same way as they do auditors. The natural inclination is to hide anything embarrassing; they won't tell them everything."
Facilities tests are almost always run by Facilities people who have a vested interest (and therefore a cognitive bias) in successful results. The Military case I mentioned before was more like a penetration test. The resiliency team *delighted* in failure - they weren't trying to prove the systems worked, they were trying to break them. That shift in perspective can dramatically change the results.
I remember on of our staff doing a pull the plug test they were specifically told not to do and doing 10 grands worth of damage to test systems. Test had been done before.
Trouble with disaster recovery tests is you don't want to test what would happen to your datacenter if someone took a baseball bat to the cage on the left by actually doing it.
[1] "Scheduling" tests never works because Operations will subvert you by shifting the workload elsewhere. That causes the servers, fans and CRAC units to idle which means the power load you're switching won't be representative of a real failure condition.
Very true, though scheduled tests can be useful for doing things like running the fuel store down on a regular basis so that when you pull the breakers on an unscheduled test (or an actual failure) your tanks aren't just full of sludge where the diesel used to be.
...scheduled tests can be useful for doing things like running the fuel store down on a regular basis so that when you pull the breakers on an unscheduled test (or an actual failure) your tanks aren't just full of sludge where the diesel used to be.
That actually happened to a hospital in the rural Michigan town I used to live in. It was later determined that the maintenance staff had been pencil-whipping the generator tests for years.
It can be worse than ending your career. if you're a systemically-important financial institution then a failed DR test can quite plausibly crash the economy. So, not surprisingly, they never get done: they do DR tests but they are very carefully rehearsed events, usually of a tiny number of services, which don't represent reality at all.
The end result of all this is kind of terrifying: in due course some such institution *is* going to lose a whole DC, and will this be forced to do an entirely unrehearsed DR of a very large number of services. That DR will almost certainly fail, and the zombie apocalypse follows.
Does it still make economic sense to buy space in a data center for a workload like sourceforge or slashdot? I would have expected a cloud service like AWS and Azure to be a more (logical|scalable|inexpensive) choice.
(Full disclosure, I work for one of those cloud providers.)
The US datacenter construction boom may be faltering and the reasons are not difficult to predict. The same supply shortages, price hikes and a lack of labor that have characterized not-quite-post-pandemic life is a risk for DC builders, too.
Construction consultancy firm Turton Bond's Darren Flood authored the report making that argument. Flood said that the need for datacenters is stronger than ever, but that "COVID-19 variants, changing restrictions, constrained supply chains and strong demand create an unpredictable market."
All of this is hitting after the datacenter real estate market exploded into its own boom times, with unprecedented investments in suitable building sites.
Nvidia exceeded market expectations and on Wednesday reported record first-quarter fiscal 2023 revenue of $8.29 billion, an increase of 46 percent from a year ago and eight percent from the previous quarter.
Nonetheless the GPU goliath's stock slipped by more than nine percent in after-hours trading amid remarks by CFO Colette Kress regarding the business's financial outlook, and plans to slow hiring and limit expenses. Nvidia stock subsequently recovered a little, and was trading down about seven percent at time of publication.
Kress said non-GAAP operating expenses in the three months to May 1 increased 35 percent from a year ago to $1.6 billion, and were "driven by employee growth, compensation-related costs and engineering development costs."
Residents in rural Culpeper County, Virginia, aren't letting Amazon build a datacenter without a fight, so they've sued the county to stop the project.
Culpeper County's Board of Supervisors voted 4-3 in early April to rezone 230 acres of a 243-acre equestrian center and working horse farm to light industrial use so that AWS could build [PDF] a pair of six-story buildings that cover 445,000 square feet on the site, along with an electrical substation.
Speaking to the Culpeper Star-Exponent, the six neighboring families that filed the suit largely argue that the datacenter would be an eyesore that would ruin the countryside. Unfortunately, eyesores aren't always legally actionable; zoning laws, however, are.
Comment AMD’s plans to integrate AI functionality from its Xilinx FPGAs with its Epyc server microprocessors presents several tantalizing opportunities for systems builders and datacenter operators alike, Glenn O’Donnell, research director at Forrester, told The Register.
A former semiconductor engineer, O’Donnell leads Forrester’s datacenter and networking studies. He sees several benefits to the kind of tight integration at the die or package level promised by AMD’s future CPUs.
“The more you can put on the same die or on the same package, the better,” he said.
Amazon plans to build five more datacenters in rural Oregon at estimated cost of $11.8 billion, according to documents filed in Morrow County last week.
The project would more than double the cloud colossus' datacenter footprint in the county. Amazon, home of AWS as well as an online shopping empire, operates four datacenters along the Columbia River, roughly 150 miles east of Portland, according to Oregon Live, which first reported on the planned expansion on Thursday.
If approved, construction of the five facilities would take place over the next four-five years, with the first facilities coming online in late 2023 and the last slated for early 2027.
Nokia’s "significant" contributions to Microsoft's open-source SONiC project and ongoing supply-chain challenges undoubtedly played a role in the Windows giant's decision to deploy the Finns' network switches, despite their relative inexperience in the arena, Dell'Oro analyst Sameh Boujelbene told The Register.
The deal, announced in mid-April, will see Microsoft use Nokia's 400Gbit/sec 7250 IXR appliances as spine switches, alongside the Finnish biz's fixed form factor equipment for top-of-rack (ToR) applications.
At the time, Nokia touted the deal as recognition of its ability to meet and exceed Redmond's evolving datacenter requirements.
Google is betting big on the future of the office with $9.5 billion in investments planned for such facilities in 2022.
Alphabet CEO Sundar Pichai said as much in a blog post, adding that Google plans to invest in new and existing datacenters as well as offices in the coming year. Pichai said that Google expects to create 12,000 new full-time Google jobs by the end of 2022. Over the past five years, Google said it has invested $37 billion in 26 US states, creating over 40,000 full-time jobs.
Google made the call to wind down pandemic-related remote work for its employees in March when it announced it wanted people back on site at least three days a week beginning in early April. Pichai said physical office investment may seem counterintuitive in the current environment, but he restated Google's position that remote work isn't the future.
The British division of global cloud and data centre services provider Sungard Availability Services was forced into administration amid a hike in energy bills and after failing to renegotiate landlord rental rates.
The UK wing called in Benjamin Dymant and Ian Wormleighton of Teneo Financial Advisory on 25 March, according to a letter sent to customers that was seen by The Register. The duo are understood to have secured interim funding to keep the business trading while a buyer is sought.
In the letter, joint administrator Dymant says Sungard had contacted customers in December to "recover increased electricity costs" related to colocation kit or other services in its UK facilities. Some customers reimbursed Sungard AS for this but others disputed the amounts.
Japanese giant NTT has found a friend to help it fund the planned expansion of its datacenter fleet.
That partner is Australia's Macquarie Asset Management – an outfit that already has $545.7 billion in assets in its tender care.
The new deal will see Macquarie spend around $800 million to take a majority position in the companies that own NTT's datacenters in North America and Europe. NTT will retain stakes of between 25 and 49 per cent in those entities.
Updated Intel hopes to gain an extra edge in the cloud and datacenter markets with the acquisition of Granulate, a developer of software that optimizes complex and older workloads for modern CPUs.
The semiconductor giant said today it has reached a deal to buy the startup, which has around 120 employees, for an undisclosed sum. Rumors of the deal emerged last week with a report from Israeli newspaper Haaretz, which said Intel would be handing over $650 million in the deal.
The Tel Aviv-based software company has raised a total of $45.6 million since its 2018 launch.
Biting the hand that feeds IT © 1998–2022