Lesson learned here ...
For those project that I maintain an interest in, I am regularly keeping a master copy on my systems. Notice, plural of system. I've developed an avid interest in recovery over the last four decades.
When GitLab suffered its database deletion, outage and related failure of five backup tools, the company quickly offered The Register an interview. Which sounded like a good opportunity to learn just how a startup aiming for serious developers, and with US$25m of serious investors' cash in its keeping, could have failed to …
The lesson my company learned was to pony up the cash and build our own git servers. 4 modest 1U boxes deployed in two geographically separated datacenters (Amsterdam and Mexico City). Each box has a near continuous backup job to a massive SAN that is itself backed up to a tape library. The git servers replicate to a matching server in the other DC constantly.
At home, I installed Git onto an old Pentium-4 machine I had rusting away in the closet, it backs itself up to an AWS micro instance with a glacial volume attached to it.
That way I actually own my code rather than it being copyright of whomever owns the server like most of the online source repositories try to do. My code stay private and if it blows up and I don't have a backup, well, that's my own problem.
"it backs itself up to an AWS micro instance with a glacial volume attached to it."
You what? AWS 101; Attached volumes are EBS, Glacier is a storage class offered by s3.
"That way I actually own my code rather than it being copyright of whomever owns the server like most of the online source repositories try to do"
Go on then... name a single one of the "most" source control repositories that claims copyright in the hosted source.
A failure to backup and to regularly test you can restore that backup, is simply a failure of taking your clients seriously. Such failures make it far harder for an software architect to recommend spending the cash on an enterprise edition.
There's a typical culture here of knowing how to do things best without accepting that IT is a 50+ year old profession, with many professionals who know exactly how to implement and test a backup regime. Startups simply don't want to engage on the important things because they think they know it all. Imagine being a builder and not learning about architecture. Imagine being a chef and opening a restaurant with no training or skills.
There are hundreds of IT professionals out there who are laughing at the naivety of a company like this. Take a step back and get some advice from those with experience. It will cost less than you think and may ensure your business survives into the future.
"Startups simply don't want to engage on the important things because they think they know it all."
This fits nicely alongside a comment in the article:
"The defence also permits startups to take their eye off the ball a bit as they pour scarce resources into urgent priorities."
It's yet another lesson we can learn from "Yes Minister": important and urgent aren't the same thing. Stuff that's important has to be done. There's a lot of stuff that might be urgent but it should only take priority if it's also important If your job is looking after other people's data then making sure you have effective backups is very important indeed; it's a ball that you can't take your eye off.
I work as an IT Risk Auditor usually contracted by massive investment firms wanting a report on start-up companies before they risk millions on them.
I've found that a lot of start-up companies try to pinch every penny they can on the boring stuff like infrastructure so they can waste it on cultivating the kooky San Francisco Start-up image.
Had one company forgo purchasing tape drives and tape (as I suggested) and instead spent it on a new open-plan office just off Market Street. The tapes would've cost about $20,000 initially for drives and media, and around $500 a month afterwards for additional tape and off-site storage. The office, which only fit about 2 dozen people, cost them an additional $30,000 per month more than their previous, much larger, office in Reno (where everyone lived). I kept trying to reason with them that they needed something more than a couple of portable hard drives hooked up to their laptops (No actual backup software, just copying files manually), but they wouldn't hear it. I ended up recommending that the investors look elsewhere due to their blatant disregard for proper IT safeguards. But I should have suspected something when some representatives from the investment bank, my fellow auditors and I had a meeting with the company founder. He rolled in on one of those aluminium scooters 5 minutes late wearing a t-shirt with a suit printed on it when everyone else was wearing an actual suit and tie (myself included).
Another client blew all the money the Angel Investors gave them on a holding an insanely expensive employee meeting in Monaco, flying everyone First Class and renting out a massive night club and conference space in the resort. That and buying everyone ipads and iphones for personal use.
Each year, it becomes harder and harder to find a start-up that doesn't absolutely suck at IT, which also happens to be the start-ups that become ridiculously successful, get bought out by a large company and continue to thrive while netting the founders hundreds of millions of dollars in profit. The reckless companies end up failing fairly quickly.
Jesus, I could literally picture the hack typing this article out with one hand and rubbing his nipple with the other. The sanctimoniousness of the article has just broken the sanctimonious chart.
I was affected by the outage, but the fact that GitLab were open about it impressed me. They could have done the typical start up bullshit of pumping out a load of buzzwords as to why the service was down. No, they said "yeah our admin deleted something".
Not at all ideal, but at the same time I wasn't a dickhead about it. Because I have been in the position before where I was in the wrong directory and hit the rm -rf command. A server that served 2,000 people went down for 36 hours while I worked with Rackspace to get the backups in place. And I had to deal with the fall out of specific bits of data not being part of the back up (they were added after the backup was made).
I would wager also that those, and I'm including you in this Mr.Nipple-rubber, have been in similar situations. They have deleted stuff by mistake, they've realised after they didn't have a back up of the file or files. Yet they're probably the ones commenting on this saying "Tut tut tut this simply will not do. MUST DO BETTER! SEE ME AFTER CLASS". Not at all helpful, and not at all needed.
I'd also bet that if a poll were to be conducted asking how many of us have backups in place, "Yes" would be fairly high. Add a secondary poll asking the question: "Have you actually tested that the replication procedures work", the vote would be split fairly evenly. Because we work in an industry where we have time pressures, budgets, expectations, all set by people who don't know a thing about IT and yet demand the world.
As with life when the shit hits the fan it's only then changes are made. In business, you don't make a change to fix something that could happen. You fix it when it needs to be fixed. I don't like that, many people won't like that, but secretly everyone knows thats the score. You only need to look at airplane safety to see that in action.
But, while all this is going on, if you put your code on here to be kept safe then what does that say about you? They offer an offline version you can run on your own server, where you can be in charge of the backups etc. You can't put all your eggs in the someone elses basket then start crying when they drop it.
I deleted part of the datafiles of an oracle database.. as they respected the standards.. until they run out of space. They then used /var/log to store dbf files...
So stopped the DDBB, did a backup of the place where most of the dbf and control files were located, did a critical step that failed (there was corruption in the DDBB) and killed the DDBB.
I restored the DDBB with my backup, but it was out of sync with the files that were stored in /var/log... so dead DDBB. Of course the backups I were not resposible of also did not backup /var/log (the scheme was cold DDBB backup, not a production DDBB).
So we all commit errors. (pun intended).
Anon as well, I like to get new jobs...
"Anon as well, I like to get new jobs..."
I've had two jobs since the story I initially said, both times I got asked the question "What was your biggest mistake and how did you fix it" or "What was your biggest challenge". Both times I tell them that, one job asked why I'd even say that in an interview, and I told them that I'm honest.
"What do you do?"
I would perform a full backup of the machine, confirm that the backup worked, then carefully construct the options I'm passing to rm (specifying the full path, running it with -i, prototyping it with using ls to ensure I am deleting the correct files). Then I'll have someone else look a the command before I finally run it. Nothing is so important that it can't wait a few minutes to ensure its being fixed properly.
1/ lvextend , resize2fs
2/ vg is full
2a/ physical server: replace first raid1 disk with a bigger one. Wait for sync. Replace 2nd drive. Create new array. Pvcreate vgextend go To 1/
2b/ virtual server: virsh attach. . Pvcreate.vgextend.go To 1/. Or shutdown ,take snapshot. Create bigger volume. Swap disk in vm. Boot. Fdisk. Pvresize. Go To 1/
3/ fix the crap and/or provision more storage.
4/ by the way: check backups are still working
"They could have done the typical start up bullshit of pumping out a load of buzzwords as to why the service was down. No, they said "yeah our admin deleted something"
No, they couldn't. That's the thing most people totally ignore: they could not go that route because they weren't big enough. If they had gone that route and things would have leaked then the backlash would have been devastating. They had no other choice but being transparent.
"As with life when the shit hits the fan it's only then changes are made. In business, you don't make a change to fix something that could happen."
That's nonsense in my opinion. An outage can happen, but you don't change a backup strategy when you realize that it no longer works? You seem to forget where the money is coming from: it's from customers who rely on a company to handle those things for them which they don't (or can't!) think off.
And it's not as if others haven't gone here before. As a startup company there's plenty of material out there which can educate you in the "do's" and "donts" of business and IT in general.
And let's be honest here: not even bothering to actually look at a backup to see if it has done anything at all? Seriously? That is, in my opinion of course, way beyond a simple mistake which anyone would make. Sure, an amateur or newbie could do that. But not a company which gets paid by their customers to look out for them.
"But, while all this is going on, if you put your code on here to be kept safe then what does that say about you?"
Reverse logic much? Well, for starters: it shows that you have faith in a starting up company to at least respect their customers and ensure that they get what they pay you for. Following your logic I guess that's asking a bit too much?
"You can't put all your eggs in the someone elses basket then start crying when they drop it."
You can when you paid them a lot of money to do that for you and they ensured you time and time again that they would not drop it. Here's a real important question: did customers get some kind of a refund for this horrible messup? I don't think so...
Say, just curious: you wouldn't be happen to work for them? ;)
I didn't read the article. I apologise to Mr Sharwood if the gist of it has been mischaracterised by the subeditor who wrote the headline, but given his previous history, I feel confident enough that this is not the case.
Like wolfetone above, I am also a GitLab customer. We did not suffer any loss of data, and we would *not* be resistant to loss of "last few hours" of data (issues, wikis). That is of course a quite deliberate decision based on cost-benefit analysis.
The subheader to this article seeks to simplify the problem beyond what is useful, credible, or reasonable, given the complexity of the operation at hand.
I would like to ask Mr Sharwood to provide his credentials as regards experience of designing, running, and maintaining complex business-critical systems. The same for those who join in the mob with vacuous criticism.
I have said it before, I do not come here to read a sensationalist blog. I come here in the hope that I will get news which are relevant to part of my operations and will be explained in an insightful manner. It's quite all right to be humorous or even sarcastic while doing that, as long as one comes out with useful technical insights that one can hopefully learn from. Saying "those idiots forgot to test their backups", as in the case at hand, is a useless oversimplification at best, which I suspect may be directed to an audience of mere dilettanti. I do not support such kind of tech "journalism".
Have you ever read the banner of this website? I think you may have missed one small part of it: "Biting the hand that feeds IT". It's why I actually read El Reg to be honest: they (usually) don't take the easy route, they don't go "awww, anyone can make a mistake so it's all ok" but request answers.
And most of all: when you make them a promise then they'll most likely hold you to it and will also be very open about the whole proceedings.
Let's be honest: only after El Reg made a bit of noise did Gitlab suddenly wake up again. Did you ignore that part which said "stopped answering e-mails"? Does that really show you an open and transparent company which is ready to back up their words, or does it show you a company which only does what it did because they had to?
Forgetting backups, forgetting made promises, ignoring e-mails (like they ignored their backups I might add)... What's next?
Just for the record: I could have understood if they simply answered El Reg then ignored them. But promising an interview and then trying to stall things... that's simply showing too many parallels.
Hold the phone.
"Did you ignore that part which said "stopped answering e-mails"?"
There are more websites out there than El Reg. There are other people they need to speak to. We have no idea how many times the journo emailed GitLab, whether even the emails went to spam. This happens. Furthermore there is no mention to a time period. So those emails could've all been sent on Tuesday without response until Wednesday evening.
Like you I read The Reg for the same reason. But recently various articles have been sensationalist. And there is a difference between demanding answers and publically belittling them. We, again, don't know how the story was put together, time frames involved, or the tone of the emails. Put yourself in either persons place, if you get a dickish email you won't respond to it. At the same time, if you send an email and you're impatiently waiting for an answer you'll send a few more in the hope of annoying them enough to respond. We've all done that.
The emails were entirely civil, from both parties. Mine >>always<< are. FWIW I grew up writing formal letters and maintain that etiquette in email to this day.
The offer of an interview was made when the outage was fresh news. I twice chased up the offer, but the email trail went cold.
After more than a week, the offer of an interview was revived. I explained the line of questioning I intended to pursue before the interview and, as the story explains, said I wanted to talk to someone with operation responsibilities rather than marketing.
Anything else you'd like to know?
They had to. If they tried to cover this up and it would eventually be revealed they could have kissed their reputation (and most likely the entire company) goodbye. Simple as that. They didn't do this because they're such a great company, they did this because of damage control.
If they were as great as they claim then they'd have gotten a techie to join the interview. In my opinion at least.
One way or the other, I think the whole incident shows us that you should never rely on a company to "keep your things safe" because there are no guarantees. But it also shows us that you're most likely much better off using Github than Gitlab.
I mean, seriously, what the heck? They performed backups to Amazon's S3 buckets and as it turned out that bucket was empty. You make a backup, and you don't even bother to check if anything actually happened? Anything at all?
If you make such mistakes as a start-up company then I can only shudder at the negligence which is bound to manifest itself when the company grows. I see all the required potential for even bigger and worse scenario's, making Sony and their plain text password storage drama a mere nuisance. That is of course assuming anyone is still willing to use their services, and quite frankly I sure don't. Pay a company for their services while knowing up front that they already screwed plenty of customers over with their excellent "backup strategy" vs. Github: not a company, all best efforts and such, but at least you'll know that the guys behind that will give it their best to ensure that things keep working.
And with "their best" I'm also referring to actually taking the effort to look into the state of your backups.
You make a backup, and you don't even bother to check if anything actually happened? Anything at all?
I also dislike (especially during the transition to live phase), the "we only send emails on errors" philosophy.
Until the system is mature, stable and tested, send success and failure emails. Emails are easy to filter. Emails are small. And you'd pretty quickly realise that something was wrong when you get *no* email at all.
Then, once you are happy that everything is working, you can go over to the 'only alert on failure' model.
 And by tested, especially in this situation, I mean, not just, "do we have data in the backup bucket?" but, "do we have data in the backup bucket and can we restore it to a test system in a way that proves the backup and restore is working properly?". But then I'm an old and cynical support person that's been burnt once too many times when backups don't and restores won't..
> Then, once you are happy that everything is working, you can go over to the 'only alert on failure' model.
I would suggest not even then.
If the regular "it worked" emails are too much, then put in a filter on the email client, so you still have the history. But one email every few hours is not much to deal with, and makes it clear that there are no reported issues.
My own personal systems generate three emails a day. It takes seconds to deal with them, and I know things have not failed. The habit is now strong enough that there non-presence triggers action.
> They had to.
Why speculate (wrongly), when you could just peruse the company's websites and come to an informed conclusion.
GitLab have been very open for as long as I've been dealing with them. In fact, thanks to their openness I have learned quite a bit of stuff and even borrowed from and been inspired by their sales qualification process.
Likewise, it is not the first time that something goes wrong with their systems, but regardless of whether it makes it to mass-media publications or not, they always owe up and explain things frankly.
It does bother me a bit that people think they can read 500 words of unverified info on a blog and immediately jump to conclusions / pontificate like they're the ultimate authority on the subject.
If YOU are not backing up your own data, then any problems and consequences are your own fault.
This is the problem with the whole "it's in the cloud" bullshit people are coming out with left right and centre.
People think that because they host their projects on third party services, everything is just backed up automatically and there'll never be any problems.
What's to stop you, for example, downloading copies to a local hard drive? Oh yeah, you can't be bothered because you're paying for a "cloud service".
What happened in this incident, and the response from Giblab, is pretty awful. But people who think they themselves have no responsibility over securing their work and data get all they deserve.
Not defending lazy customers, but when you offer a paid-for service that includes backups of the data, you damn well better do some bloody backups. I could only forgive a company not backing up client data if they explicitly said that they aren't going to back up my stuff. Not backing up customer data is a stupid thing to do even from a business perspective since if the customers are responsible for backups and you lose their data, they are quite likely to just restore their backups to your competitor's service.
Gitlab does not hire sys-admins, they only hire "DevOps" folks. These are the types of folks (and I've worked alongside many), that believe automation is the be-all end-all in the infrastructure space. I am usually brought in to *manually* diagnose and resolve problems caused by mindless automation. In 97% of the cases, I find that the folks who write and manage the playbooks/cookbooks/recipes/whatever, in fact, never go back and actually login to the systems at hand, to verify the automation is doing what they told it to do (or they test on test/dev systems only, and assume prod push will go fine). I've done this dance way to many times. I'm usually paid about 20-25% less than those kids as well (in case you cared). I always ask the same questions "why not login to the systems (or issue non-interactive commands via ssh/script) to validate the automation". The kids usually laugh and inform me that ssh'ing into boxes is sooooooo 1999. "It's all about cattle, not pets.", they tell me.
Also, check out their devops (or other software engineering) job listings. This is a place that ONLY hires people with ruby experience. <<---- That should pretty much tell you everything you need to know.
I've been doing this for 25 years now. In various companies, 19 with the current one (despite being outsourced and returning to the fold).
I *have* affected availability. Several times. I take full responsibility for that when it happens. I also end up *fixing* it when it happens.
Anyone who has their hands on the hardware and access to <root/Admin/console> will have that moment. Usually several of those moments. It is one's ability to clearly state the issue, and the course of action to follow to resolve the issue that is absolutely critical. Does not matter what that course of action is so long as it puts the service back on line *safely*. One also has to *listen* to alternatives and provide solid reasons to either use those alternatives or NOT use those alternatives.
In my opinion GitLab blew it by not validating all the recovery method they had in place and testing them. But they handled this particular situation as best as could be done under those circumstances. There is no magic that will make a backup work when it wasn't taken properly in the first place. If they've followed a course of action that will ensure that those backups work in the future, then they are doing the right thing.
Biting the hand that feeds IT © 1998–2021