
Great
I wont even connect to our gitlab.
So I guess we lost all our synced files.. GREAT.
I have a copy of everything on local.. and 3 backups, tomorrow I will see if my work colleagues are as paranoid as I am.
Source-code hub GitLab.com is in meltdown after experiencing data loss as a result of what it has suddenly discovered are ineffectual backups. On Tuesday evening, Pacific Time, the startup issued a sobering series of tweets we've listed below. Behind the scenes, a tired sysadmin, working late at night in the Netherlands, had …
Actually, I've worked in places that did emergency testing. They did things like "let's kill all the connections through one datacenter and see if our customers can still access their stuff." Also "let's restore the backups to this test system and see that it works". As Gitlab has now discovered, backups without restore testings are not backups....
You're lucky to have the budget to do it. Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7, so no way to do a test. No test system to try the restore on (and besides, it's different from the live system, so things can still mess up in actual settings), and no way to really test for emergencies because they depend on things that ONLY occur in real emergencies, such as power to not just the floor but the whole building going out (and perhaps next door as well, just to be sure something wasn't plugged in a jury-rig).
"You're lucky to have the budget to do it. Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7, so no way to do a test."
Maybe you should rethink that..... Are you lucky enough to have the budget to NOT run a restore and risk the loss of everything? The cost of a properly tested restore system should be a vital part of any project budget. Your lack of a test/restore system is basically saying to your customers that if things go wrong we are bankrupt and possibly they are as well (if they are external customers).
>Many times, people only have ONE live system (all they can afford) which MUST remain up 24/7, so no way to do a test.
Sounds to me like the failure is in the business model of the company. Those generally are the type of companies that are one recession or self created disaster away from administration.
"Sounds to me like the failure is in the business model of the company. Those generally are the type of companies that are one recession or self created disaster away from administration."
That's why it's called living on the razor's edge. Where margins are close to zero all the time. You'd be surprised how many firms HAVE to live like this because they flip between profit and loss every month. You're floating in the ocean and you barely have the stamina to tread water. Sometimes, that's all you're dealt. All you can do is hope for shore or some flotsam.
>Sometimes, that's all you're dealt. All you can do is hope for shore or some flotsam.
That's fine if you are a brave entrepreneur who has few limits of what he/she can reap if they succeed but taking a job at one of those companies is another matter (especially without a big ownership stake). A big part of job interviewing from the view of the interviewee is figuring out if the company is one of those companies. If you do take the job then it probably means you need to do a better job researching companies or you need to increase your skills and experience so you don't have to work for those type of companies for long if at all.
Forgot to expand on the whole hoping to cash in on the ground floor of a startup angle which again is fine I suppose if you are young or aren't the only income earner in your family but still probably won't end up being one of your wiser choices more than likely. If you are lucky you might get to keep the actual pets.com puppet though after everything goes sideways.
"A big part of job interviewing from the view of the interviewee is figuring out if the company is one of those companies. If you do take the job then it probably means you need to do a better job researching companies or you need to increase your skills and experience so you don't have to work for those type of companies for long if at all."
Or it simply means you're out of options. If they're the ONLY opening, then as they say, "Any port in a storm."
>Or it simply means you're out of options. If they're the ONLY opening, then as they say, "Any port in a storm."
Which is fine unless you spend decades in that situation and then turn around and blame globalization for all your problems. Not you per say of course but a significant number of people.
If your system is set up such that your disaster remediation cannot be tested, it is set up WRONG. There, I said it.
I've shared my personal example before. Previous job, my so-called supervisor messing with SQL queries did a Delete thinking it would clear his query. Blew away the whole database with a single click. The never let me test the backups, claiming "24/7, can't be down for testing!". Instead we were down for THREE DAYS while an SQL consultant helped rebuild from scratch, and we never recovered all the data.
Disaster testing always creates some inconvenience, but that's no excuse to skip it. A smart captain never complains about the lifeboat drills.
Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
A company running a production service with this backup strategy in place should be deleted.
All those rushing in the criticise, how confident are you in you processes?
Before the event these guys were very confident, they had LVM snapshots, DB backups, Azure snapshots and even a copy on s3 in case Azure goes down. I bet they were even tested in the first place.
Even after all this very public embarrassment they do still have a copy from 6 hours earlier.
How many of you could suffer multiple failures and still be able to do a same day recovery?
Having read your comment you have restored my faith in people. On topics on data loss where the backup failed I usually see comments about 1 backup is none and 2 is 1 etc, so these guys had 4 by that rule and still got stung. I feel sorry for whoever nuked the data and people responsible for making the various backup scripts. I know if it was me I would be feeling bad and wondering what I could have done different, and in future probably becoming more paranoid of all the systems I have in place.
I wish these guys well and hope they dont get too much stick, just the opportunity to fix it.
Mmm.
There is an old tale called the Tao of Backup, written in the days when NT4 was shiny new equipment people aspired towards having and the press was talking about the upcoming Win98. It's still available online.
http://www.taobackup.com/
Skipping the first four points we get to :-
5. Testing
The novice asked the backup master: "Master, now that my backups have good coverage, are taken frequently, are archived, and are distributed to the four corners of the earth, I have supreme confidence in them. Have I achieved enlightenment? Surely now I comprehend the Tao Of Backup?"
The master paused for one minute, then suddenly produced an axe and smashed the novice's disk drive to pieces. Calmly he said: "To believe in one's backups is one thing. To have to use them is another."
The novice looked very worried.
It then links to this page which lists some of the things that can go wrong due to not fault of your own.
http://www.taobackup.com/testing_info.html
Gitlab didn't have backups. They had backup scripts that didn't work. Having those sort of problems happens. Not testing your backups and discovering those problems before an emergency hits is inexcusable.
"Gitlab didn't have backups. They had backup scripts that didn't work. Having those sort of problems happens. Not testing your backups and discovering those problems before an emergency hits is inexcusable."
^ I fully agree with this and the question then becomes, "Will Gitlab learn from this experience?". I hope that they do and now set up and test suitable back up systems.
Examples of how not to do things include TalkTalk with multiple breaches where they do nothing, bad things happen again and then customers leave.
"Not testing your backups and discovering those problems before an emergency hits is inexcusable"
How do you know they didn't test? I guess they did test, but the Postgres version changed and the database backup stopped working.
So they didn't test regularly enough. What is regularly enough? And how many people actually do that, rather than take a holier-than-thou attitude on a forum.
Completely agree with both of you. There's no doubt they made mistakes, backups, replication etc should have been tested properly and clearly wasn't, but do you know what really hit me reading about this? How open they've been about the whole thing. From the initial issues through to publishing the live notes for the restoration attempts, I'm personally very impressed that they've been open and honest about what's happening. So many other companies would and have hidden behind vague "we're working on it" responses, and while it doesn't undo their failure to test, I think their honesty does need to be acknowledged and commended.
They could start by delegating someone to be responsible for checking the results of their backup plan, on a daily basis, to ensure that backups are actually being made and are valid.
As a bare minimum they should have scripted this to fully test the output (Is the target accessible, does it contain a backup, is the backup valid?) then email a warning in the event of failure. Or better yet, use the dead man's switch method by emailing the successful results of a fully verified backup, then have a warning issued on the admin's local system in the event that no such message is received, to account for not only backup failures but communication failures too, and beyond that have the admin manually check for such messages on a set schedule, in case the local warning system (cron et al) doesn't work for any reason.
Beyond that minimum effort, they should also have had full server replication on a hot spare, not only for failover but also for bare metal restore testing, which is ultimately the only way you can be sure your backup process really works, as opposed to "seems to work", "completed without error" and "says it verified", all of which amount to exactly nothing. Until you have successfully performed an actual restore, you simply don't have a backup.
The most shocking part of this is not that backups failed, but that nobody noticed, at least not until it was already too late.
Essentially they had no process in place, they had a token effort that was cobbled together then not even verified as working. That's the unforgivable bit.
It is all relative.
For example, in our small team, we destroy our entire staging environment and perform complete restore from backups once every month or two. We also use the production backups for restoration to verify that these are working in a separate environment. I deliberately get someone that hasn't done it recently to check the documentation is correct and everything works. Given that we also have the git repos and source on multiple dev machines also for everything, the worst case data loss is from a dev not committing regular changes. But everything is tested, and I believe regularly enough for a team of 10.
If I was hosting thousands of peoples data, I'd expect a more regular testing of the backup and recovery procedure. The can never have tested it with production data as they have no working backups. That is just incompetent.
> All those rushing in the criticise, how confident are you in you processes?
Very confident in our backup system. Quite frankly because I insist on regular restore testing from backups, precisely for this reason. A habit I got into many years ago when I worked for a big company where we stored very sensitive personal data.
Until you have done a successful restore, you don't have a verified working backup system. This is simple 101 of sysadmin work. It isn't something exotic or really hard to wrap your head around.
There was the joke that you could back everything up to /dev/null just fine. However the restores would be much harder. Sounds like the gitlab guys inadvertently used a similar concept as their backup strategy.
I can understand the accidental rm -rf, mistakes happen, we are human after all. However backing up (presumably for years) without never even bothering to check the size of your backup is just gross negligence. Surely a quick check of the backups, noticing all the files are suspiciously 512 bytes in size would have been enough to draw attention that maybe something wasn't working in the backup system.
If what they say is true and they essentially have no backup, then this was a completely avoidable situation that was of their own doing.
All those rushing in the criticise, how confident are you in you processes?
You're right, although I'm old school - where I come from, backups aren't as important as restores, so a backup wasn't until you could prove you could actually use it to recover.
That said, the first ever from-the-metal-upwards restore test I did was the single most nerve-wracking thing I've done in my career, despite the fact that I had two separate backups and a hot standby site to take over if recovery took more time than the available test window.
That said, you haven't earned your sysadmin badge if you haven't learned the "rm -rf" lesson the hard way...
"That said, the first ever from-the-metal-upwards restore test I did was the single most nerve-wracking thing I've done in my career"
I've worked in two places where we had DR contracts including rights to run practice restores. They can be learning experiences, especially the first one. /etc was the last directory on the last dump tape. We had to sit twiddling thumbs waiting to get a system we could log into & then ran out of time before we had a restored system. It ensured the dumps were better organised for the next pass.
"All those rushing in the criticise, how confident are you in you processes?"
Confident enough to know it works through testing.
I could defend an honest mistake, not a short sighted lazy one. It could be argued that they were not lazy (who knows, maybe their testing matched their design), but they were clearly short sighted. Either they were or they were not designing to be bullet proof, which was it?
All they had to do was test for the worst. This isn't a after school special, not everybody is winners, not everybody deserves a rewad. At least they owned it. They could of said "You're holding it wrong."
At my 2nd day of a job, I deleted the entire stack of the test system with a misplaced rm -rf.
I crapped myself thinking I'd be instantly fired. My boss made some 'angry' sounds, then told me it's not the biggest issue as they needed to try a fresh install of the new version anyway (as thats' how the new version would be rolled out in production rather than upgrading, which is what they normally did on the test servers.)
This also allowed them to fully test the backups, pulling the older data from the production backups, anonymising it as required and also finding some faults with various processes that were included but didn't work after the upgrade. In all the test system was down for about 4 days instead of 1, but the fixing of the systems to allow it to get the go ahead in production took a month or more. If I'd not 'slipped up' then they wouldn't of known these issues until trying to go live in production and if so, it would of been a very long night of around 6-8 hours reinstalling the older version back into production (after the 6-8 hours of installing and testing the new version).
This attitude of 'we can't afford to test it' is utter bollocks. You fire up as many vm's as required in the cloud, and you at least verify the _data_ is there, even if the functionality isn't. It's bad to find the code for the production system isnt' backed up as much as you think it is, it's unrecoverable to find out the data is gone.
These guys got lucky, if he hadn't taken that copy 6 hours before they'd be dead in the water and the company would be gone.
We test our backup processes quarterly. Real data is deleted and restored from backup, the files opened to check integrity. Failover servers are shut down and the handoff to partner server checked.
Can we guarantee nothing will go wrong? Of course not, such a thing is not possible. Sometimes when a ship sinks it lists too far and half the lifeboats cannot be used. That's no excuse to skip lifeboat drills, nor is "the crew can't stop what they're doing to run safety drills."
So with apologies to Adam and his admirers, the Gitlab geeks did not do their jobs properly. They admit that the backup files were too small to be believable. That alone should flag their systems as not working, with no shutdown testing required. They missed several chances to spot this.
"We test our backup processes quarterly"
...and you're confident? Then you are far too complacent. You could easily lose 3 months of data if something's gone wrong. Think about all the things that could have failed between your last test and now - the tapes, the tape drives, the disks, the software, a permissions issue, a version mismatch, a CRC algorithm change, new directories, new servers, credentials, sneaky ransomware, ... the list goes on.
I test weekly. And I still think my backup process is worse than the one these guys had.
When I was teaching at a College the account containing all of the homework submissions was deleted by a rogue script. Next day in class I used this as an example as to the necessity for having good backups. I saw many smirks around the room until I pointed out that we took backups seriously (with weekly and incrementals) and that the account was fully restored losing about 4 hours of work that the students could easily resubmit (unless they deleted their submissions which is unlikely). We probably could have restored more but didn't want to restore after the rogue script started deleting files.
On another note, we had an IT professional from Paychex (a company that prints checks for small businesses) give a talk on something or other and after the talk I asked how the company survived a state-wide power failure lasting for several days. His answer was that they keep 6 copies of their essential data spinning and online. During the power failure all UPS systems worked perfectly and they lost nothing. However their customers could not download their information because they did not have power so Paychex had to rent trucks to cart all of the checks their customers couldn't print. The point is that the entire supply chain needs to be considered.
I work at Veritas and I always have trouble explaining the need for good backup/recovery and DR solutions to my young friends. This is a nice link to send them in the future. It hurts to see this happen to a group of people, but hopefully it will lead to others testing (or implementing :) ) their strategies!
Nobody ever test-restored a backup.
That is a step too often skipped, because you don't want your test to overwrite live data, so you would temporarily need as much space elsewhere as the restoration takes. In fact, you better have a complete spare system to test you can make everything working with the backup. May be difficult to arrange.
The issue with a real restore from backup is that when an issue strikes you try these things:
1) Attempt to fix the issue as you don't want to load stale data or risk overwriting your live system, you think it'll take 1 hour but you've invested so much effort that after 6/12/24 hours you are almost there but keep hitting snags but are close enough not to turn to the backup (all the while every half hour you get a call asking for updates)
2) Eventually you decide you'll have to go to backup - data is now even older. You start the restore with very little idea how long it will take but estimate 1 hour. As it starts restoring and the progress bar whizzes along and tells you 2 hours to go you feel hopefull. The progress bar gradually slows down and the time starts showing 2 hours, 6 hours, 12 hours to go. After 4 hours you think something is wrong with the restore so you abort it and decide to manually copy the files over and restore from a local mount point. Repeat issue above with copy times getting gradually longer. You start looking at jumbo frames, data transfer graphs, can proficiently convert bits to bytes to to Mbs to Mibs in your head.
3) You update everyone that it is actually going to take about 24 hours to recover that much data and go home, waking every 30 minutes to remote in and check progress.
4) After three days the few TB has copied back but the restore fails with an error 00x0ffx0f00075844 Unspecified Error. Possibly something to do with merging the incrementals.
5) You reach for your last full backup, data that is now 7 days old. You wait three days for a full restore again. The restore is a success but none of your db based products work. SQL won't start, Exchange won't mount any DBs you've got files restored but a number of iSCS link are broken on some apps.
6) You spend the next hour/6 hours/3 days trying to work out how to clean mount a db with partial corruption logs spending more time on various forums than you care for and eventually bring most things back to life although Jim from finance still can't access any e-mails, the Finance system has rolled back a month of transactions due to consolidation errors and no-one can access the intranet anymore.
7) You feel relieved to have made it through, what feels like a war zone single handedly with relief but then feel angry that you had no real support and no-one seems to understand you mix of emotions from anxiety, to panic, to fear, to relief and back to helplessness you have just been through.
8) You go to speak to your manager and they tell you they are bringing in a consultant.
9) You get fired.
>a lot of schadenfreude right now.
Really? You're deriving pleasure out of seeing this failure? How very cold-hearted of you.
Any time I read these kind of stories I'm filled with relief that I'm not in the team that has to fix the mess, and have nothing but sympathy for them. We've all experienced screw ups like this, we all know what a stressful experience they are.
Yes, they messed up and should have tested their backups. But I take no pleasure out of seeing others' work go tits-up or lost.
It sounds like the sysadmin did not have a proper plan, was probably tired, preoccupied or bored, and rushed things, and someone did not put in place enough script logs/failure-alerts, backup verification logs/alerts, and do regular checks for both, so that the backups worked...
I do wonder why the database wasn't mirrored to reduce downtime for upgrades or other failures like this. Some redundancy should be compulsory for all professional systems.
Maybe GitLab could use OpenZFS with regular dataset filesystem snapshots, for rapid rollback to a snapshot before damage/deletion of files/data occurred, or (read-only) mount the snapshot and use some files/data off it to fix the active filesystem or make consistent backups; I've found the later very useful when I've accidentally deleted stuff e.g. on FreeNAS.
"Some redundancy should be compulsory for all professional systems."
Ya, but who's going to be the adult in this situation? Obviously one was lacking.
Regardless of how cloudy, trendy, and hipstery your company is, hire at least one adult. The one who knows the hard questions, and will ask them.
hire at least one adult. The one who knows the hard questions, and will ask them.
It's not enough to be able to ask the hard questions if you can't come with some of the answers for it too. Sniping from the side lines is easy but being able to assist in making sure the answers are right is where the skills lie - it's old adage: "don't bring me problems, bring me solutions".
IT is hard.
Backups are a pain in the ass, for exactly the reasons mentioned here. All ye who apply a rigorous and robust backup policy, I applaud you, but I doubt that a single one my employer's clients falls into that category, and we have many, many clients.
Anyone know of a product that you can point at a database, provide it credentials, and it handles all the rest, including test restores with error messages on failures? That's not to even get into file backup, but file backup is notably more simple in many ways, especially with the right tools (ask any ZFS admin).
Used to have an EMC san with recover point that took an oracle db block level clone once a night broke the clone remounted it, renamed it and replayed it so you could use it in the morning. Still not there though as that didn't work for creating offsite backups that you knew worked. Just that we could restore to any point during the last 24 hours happily.
There's a lot pontificating here but I wonder how many really have battle tested their backups enough to be so sure.
That means there shouldn't be much real data loss at the final count, as the important things pushed there will have a backup in the place that pushed the last commit.
Pain in the proverbial for all the project leads to push up all the lost branch tips again, but at least it's only mostly dead and not dead-dead.
It's pretty easy to screw up even when you have a working backup.
Recently had a situation where a client had messed up a data upload, and we needed to fix the data. Our software being fairly forgiving, the plan was to download the automated back-up, then backup the live dataset, attach the downloaded files alongside the live system database, and repair the damaged data with the values from the backup.
Would have worked great except in between this work and our last test, the backup software vendor had changed the recovery process to include a helpful feature, namely if you don't expand the file tree fully, it downloads the files and then automatically and silently over-writes the live database...
Needless to say, the client lost the 2 days worth of data that we'd fully expected to be preserved.
Re-written the procedures to move the manual backup to prior to downloading files but even with the best laid plans, backup software can be a tricky proposition!
I'm not that thrilled about anything cloud based and prefer to host my own repositories. And here's one of the many reasons why. For starters: I actually check my backups on a regular basis, even when I don't need them.
I'm not even going to bother commenting any further because this is simply too big a fail. Makes you wonder what kind of geniuses work there. And what they're doing all day.
There is little excuse for being unable to restore from backup. "The dog ate my backup tape.", doesn't cut it. Having to deal with a full-scale civil insurrection that has trashed your off-site storage location might give you a pass.
The thing is, it is human nature to put something off that is difficult that has no immediately apparent consequences, so you need a martinet in charge of backup and restoration services, because your business could well live or die as a result.
I worked for an organisation that did annual full-scale disaster recovery exercises, which were instructive. The chap in charge of them was quite sanguine about failures: his point was that he far preferred to find things didn't work during an exercise than during the real thing. As this is about backup, I won't go into some of the more interesting organisational failure modes we found, but the backups caused problems.
1) The off-site storage vendor couldn't locate some of the backup tapes. It turned out that nobody had ever audited their retrieval performance. Talking to the operators, it was found that it was common for tapes to go missing. All the operators did was use a new blank tape when it was time in the backup cycle to re-use one that couldn't be found. The organisation used enough tapes in day-to-day work that the small number new tapes used to replace the missing backup tapes wasn't noticed. The operators' job was to do backups, which they were doing.
2) At least one of the tape heads was misaligned. It would quite happily write out a backup, which could be read with no problems on that drive, and that drive only. When the time came to ship the backup tape to the disaster-recovery location, no tape drive there could read tapes written by the original drive. Lesson learned was to make sure you could read backups on different equipment than it was made from.
3) It turned out that referential integrity is important. Who knew? Backing up files while they were in use, then trying to use the restored backup caused all sorts of problems. This was before the days of journalling file systems and snapshots. The application developers had failed to appreciate that backing up a large file took an appreciable amount of time, and in that time many records would be changed. An update that added or modified records both near the beginning and the end of the file would end up with only some of the updates recorded on the backup. That was solved initially by having a backup window where no updates were allowed to the file.
These days, some of the problems are solved by backup up to 'the cloud' and having the ability to snapshot databases, but doing proper backups is a neglected art. There is still a point at which it is not economic to back up over a network and you have to plan for moving physical media about, and life starts to get interesting.
Doing backups and testing restoration procedures is a huge time hog, and it isn't sexy. But it is important. I hope these guys get themselves out of the hole they just dug for themselves. Some people probably have unhealthy stress levels right now.
I worked for an organisation that did annual full-scale disaster recovery exercises, which were instructive. The chap in charge of them was quite sanguine about failures: his point was that he far preferred to find things didn't work during an exercise than during the real thing.
People don't appreciate that failures are a wonderful learning experience. In my line of work, I've learned a lot more from unpicking a failure than working on a fault-free system.
I've also heard several instructors across different areas say that they often prefer pupils who appear to make lots of mistakes as the pupils learn a lot more from the mistakes than those who do things right every time.
When the time came to ship the backup tape to the disaster-recovery location, no tape drive there could read tapes written by the original drive.
I have also seen this with optical media - readable (probably just) on the original drive, not on another. Probably not after several years either.
As you mention, snapshots are a brilliant idea - instant copy of a whole file system for backing up so (mostly) no inconsistencies, and with copy-on-write like ZFS you only need space for the changes so having many per day is not a high cost. However, as you mention in some cases the on-disk file is not always in a consistent state when a process is using it so having time to do a snapshot with no modifications is also good.
"t turned out that referential integrity is important. Who knew? Backing up files while they were in use, then trying to use the restored backup caused all sorts of problems. This was before the days of journalling file systems and snapshots."
Don't roll your own encryption and don't roll your own database.
This was a solved problem years ago without depending on journalling file systems and snapshots. Use a proper database engine that does this for you.
This was before the days of journalling file systems and snapshots."
Don't roll your own encryption and don't roll your own database.
This was a solved problem years ago without depending on journalling file systems and snapshots. Use a proper database engine that does this for you.
You are, of course, completely right. However...
The system used multiple files that were a kind of ISAM-type file*, and had been optimised to hell and back. It did its job, and fast. Several attempts, lasting may years each with large teams, were made to replace it with a 'proper DB', all of which failed to achieve the necessary performance. Half the application operated for decades** before being replaced by an entirely different system, the other half is still going, although its functionality is gradually being replaced by other systems, so eventually it will be sufficiently obsolete to decommission.
I had a lot of conversations with the DBAs of proper databases also in use within the organisation, and the solutions proposed involved throwing a great deal of very expensive hardware at the problem. The business had a very simple question: "Why do we need to spend N times more money on the proper DB to achieve exactly what we are doing now for far less?". Having a backup window during which updates were blocked was a pragmatic (and thankfully, workable) solution
Times have changed a great deal, and what was once very expensive hardware is now available very cheaply - Gigabytes of fast RAM, much faster processors with multiple cores, huge RAM-disks, Terabytes of spinning rust (and now, SSDs), so if you were starting again, you would simply throw enough (relatively) cheap hardware at the problem so you could run one of the newer databases. It wasn't an option then.
Its quite interesting how, once a system is up and running, it is often cheaper to continue with it than build a more modern replacement - until a compelling event occurs - as a result, you can find some quite remarkably old business critical applications and systems in use. When you find you are forced to buy your spares on Ebay, then that is probably a good signal that moving to a newer approach is a good idea. That doesn't stop some people, though.
Sorry for mansplaining. Please don't take this as criticism.
*I'm deliberately not going into detail so it is not identifiable
**I'm carefully not saying exactly how long.
Anonymous just in case anybody might stand a chance of recognising the companies involved
I had 6 months of easy but boring contract work in 1999 for a company who had the foresight to have there backups audited. They discovered that none of their off-site backups were valid. The process was that backups would be made to on-site tapes, then these would be cloned to off-site tapes. Problem was that the backup windows was too small and the clone job was set as low priority.
The on-site backup was working fine but the clone jobs never got to finish because the tape drives were re-allocated to higher priority jobs.
I got to re-jig the backup, keeping 2 of the drives under manual control so that I could check that the clone jobs were all finished before releasing the drives to other tasks.
Then we recalled all of the off-site tapes and I had the great job (when the drives were less busy in the afternoon) of feeding batches of the off-site tapes into the library to redo all of the clone jobs manually.
It was shocking how many of the tapes arrived back from the vault company with physical damage and even some which could not be found at all.
Some years later at another job, the same vault company was used. I made myself very unpopular with them by regularly requesting a random tape to make sure that it was available, intact and readable. There was a provision that in emergency they would have any tape available for physical collection within 30 minutes. As I had to virtually drive past the vault on the way to one of our datacentres, I would sometimes drop in on the way and request a tape. The receptionist would silently moan when I walked in after a while.
This was no guarantee that the backups would work, but I had the reassurance that we were taking steps to maximise the chances.
The contract with our customer called for annual DR tests. In 8 years they agreed to 1 very limited test. They did not want to disrupt there important business activities.
> it's a bit hard to understand use of rm -rf ... at the command line?
Nobody every done "rm -rf . /tempfolder"? One typo, and hell to pay if you don't notice it.
As a junior SA I once did that on the root box of the SAN in the company root directory. I blew away the entire companies data at 6pm, when I was tired and in a bit of a rush to go home. Wanted to clear out some temporary dirs I created for testing. Hit enter and started packing to leave. Only when the monitoring went haywire did I log in again and realise what I just did. Hundreds of millions of files across god knows how many divisions were deleted.
Spent all night till 3am restoring everything, and writing scripts to pull fresh data, and then setting the permissions just right.
The next day only 5 people noticed discrepancies in their data, which was a phenomenal result, but it was a life lesson as well. Despite managing to recover almost all the data, I was "asked to leave" shortly after (can't say I blame them).
Thank god for ZFS snapshots and a verified backup system, otherwise the company could well have ended up having to cease trading. Or at least losing untold millions and millions before they could start to function again.
I also am really really careful around "rm -rf" commands as root on machines now.
"Nobody every done "rm -rf . /tempfolder"?" -- Ogi
30 long years ago I over-lingered on the SHIFT key, rm -r *.o became rm -r *>o and left me with a single file containing a single byte. Now I usually put an -i in, and when it seems to be right, exit and edit the command line to remove just the -i before setting it off in anger.
But for specific critical folders you could use cp -al DoomedFolder/ QuickSnapshot/ ... Now, QuickSnapshot contains hardlinks to all the files and folders in DoomedFolder but because you haven't copied anything you don't need all that much space (just a bit for the new inodes) to do it (or much time to execute it). Now you can rm -r DoomedFolder and you've still got a second chance.
I was suggesting the use of a shell script. Perhaps I should have been more explicit?
Quite apart from that I've written my own 'del' command using 'mv' instead of 'rm' for use at the command line.
Creating such a script is left as an exercise for the reader. Write on only one side of the intertubes.
"Quite apart from that I've written my own 'del' command using 'mv' instead of 'rm' for use at the command line."
I've had an accident with mv and managed to move /bin further down the hierarchy. It should have been possible to recover by booting from the distribution disks but the vendor had omitted the driver disk for the SCSI controller. It was the following afternoon when the controller vendor finally emailed us a driver.
Reminds me of the time a colleague of mine a few years back was doing some changes to a MQ submission script, that simply polled directories, defined in a configuration file, then pushed the files onto the queue, and deleting the files afterwards (they should have already been archived by this point in the process).
It took them a while to notice what was going on, but the system basically ended up eating itself! Submitting all the files it was meant to, then working back up the directory tree, and submitting and deleting the next directories contents it found, and so on.
Thankfully permissions mean the process could only really 'eat' data and a few test scripts, rather than system files, and this wasn't a production environment, although it was in use for UAT at the time! Queue questions from the clinet, "My data's gone, but I can't find it where it should be! Does anyone one know where it went?"...
That must have been a very painful document to write, but it's a great real live scenario and a future test case - How many people have screwed up backups and kept quite and vague about it for operational reasons or pride?
Hopefully, someone will learn a lesson from this, but I won't hold my breath.
I've done this. Ran an "rm -rf" command on a production server due to an email system creating a huge log file that brought the whole server down. Early in the morning, noisey open plan office, I run rm -rf forgetting I'm in the root directory. It wasn't until Linux started saying "/boot/ could not be removed". And I'm thinking "Why is there a /boot/ directory in this folder?". I cancelled what I could, but it was too late.
Server was off for 36 hours because the great guys at Rackspace tried to restore a 120GB backup to the 60GB drive what was unaffected by my mistake.
But hey, it was the first time in my career I did that and so far it's been the last time.
Back in the days before package management I was upgrading some libraries including ld.so - the dynamic library loading library. I moved or deleted the old one, and the next command to run was "mv newlibrary.so ld.so". But of course "mv", along with every other command on the OS, was dynamically linked. It didn't end well, although I did learn my lesson.
I'm sure they'll get all the stick, there are certainly failings
But management often tends to not be so interested in DR, until something like this happens. Especially in companies who are running to just keep up with constrained resources.
I have seen many times IT depts wanting to test DR, and management will not provide resourcing (equipment and/or staffing) to test this. Also they won't take any interruption to production systems to test a DR solution.
#RMRFocalypse.
Pretty bad, but hopefully they will learn from this £xp£ri£n$e and test their backups properly.
I've lost data before but not quite on this scale, lesson learned to lock your PC especially when some random work experience drone comes along with his 2 friends and goes "Oh lookie, a hex editor"... Facepalm!!!
The only way to be confident in your backup plan is to have tests to make sure its working.
If you backup nightly, you could automate grabbing the latest backup, restoring it to a throw away instance, ensuring that it completed properly by checking record counts in various tables. You could run that every other day. Or better yet, once your backup process has completed, to verify that it has indeed worked properly.
You could still get caught out in many ways but verification to some extent would give you more confidence.
I can understand how this has happened though, start-ups are not the same as large corporations with the resource to have people spent a long time ensuring backups are rock solid and testing disaster recovery efforts monthly etc. In an ideal world, that'd be quiet high on the agenda, but realistically, breaking even is the first hurdle and you don't (technically) need a backup plan for that, so it gets put to the bottom of the list.
That's if you can afford an instance or some other fallover. Many CAN'T. Yes, it's stupid, but if you're stuck in the middle of the ocean with nothing but a piece of flotsam, what options do you have besides exhausting yourself treading water?
As said, breaking even is priority one because you're obligated to your investors first. If they don't agree with you about long-term investments, than again you're stuck because they can pull out, killing you BEFORE the disaster hits.
Yesterday,
All those backups seemed a waste of pay.
Now my database has gone away.
Oh I believe in yesterday.
Suddenly,
There's not half the files there used to be,
And there's a milestone hanging over me
The system crashed so suddenly.
I pushed something wrong
What it was I could not say.
Now all my data's gone
and I long for yesterday-ay-ay-ay.
Yesterday,
The need for back-ups seemed so far away.
I knew my data was all here to stay
Now I believe in yesterday.
This post has been deleted by its author
So it seems:
1) They are not using Barman for backup managment in postgress
2) They tough to being able to vacuum (full otherwise the space reclaim they needed was not
happening)
3) They don't have a PITR ready to be used
this is how IT works today: "We need a database". "Sure". <after googling> "apt-get install postgresql". "Done". (some time you even find a docker ready...)
I created typo-geddon once using chown.
As root, I tried to give my user ownership of files from my current dir down (./)and instead put in the space of doom and changed ownership from root on down.
I realised the command was taking too long after about 3 seconds and hit ctrl-c.
I genuinely thought I'd got away with it at first until the machine had to reboot, and a lot of the fundamental stuff that ran the machine (IBM AIX) got read from disk again.
The box itself could be done without for a while so my boss at the time got me to mount the drives and undo a lot of the damage manually to reinforce the lesson (which I've never, yet, had to relearn).
1. - They were honest
2. - They are now live straming the recovery process via YouTube.
https://www.youtube.com/watch?v=nc0hPGerSd4
New Gold Standard in dealing with customers after a mamouth fuck up if you ask me.
Still - shouldn't have happened in the first place, but compare this to other companies and im not sure i could ask for more.
*No a gitlab customer
This is just one of those horribly ugly situations. Feel for the SA that hit the wrong command in the wrong place at the wrong time. We are all human and I'm sure anyone that's been in the business long enough has done something near enough to identical to feel for this individual.
I've found that being in that place, *THE* most critical thing to do right then and there is stand up and tell those that absolutely need to know that you've buggered up. And if you know what you can do to recover, lay out those options (Please, note the plural there, you should have more than one option). Otherwise bellow for assistance. Seems this SA at least hit that set of rules.
Backups. Snapshots. Copies. etc.
They can *all* fail at different times for different reasons.
This sounds to be like a case of too many disconnects between groups as to which is what and who owns what.
I've written DR plans. I've executed them. I've audited DR execution. I've fixed DR plans after the test. I've tried. Really I have tried. But unless your DR process is part of your day to day execution those plans get to be crap every 6 months or so since the apps and systems you're restoring change pretty damned rapidly nowadays.
Now, I'm gonna go back to trying to figure out why 6 tape drives on a sun box have crossed up data and control path device files.
Given the recovery process, it looks like the DBA is pretty competent - he may have made a huge mistake (in my experience, competent people can hugely misjudge the risk of their actions) and is fixing it. Not a perfect fix, but able to recover all but 6 hours of data and able to quantify what was missing in under a day isn't bad given the number of issues found.
It looks like the root cause was attempting to get replication working from live to staging that broke the db1 to db2 replication process - the issue may have been related to performance limits in the staging environment. There was then a period of high DB utilisation issue that may have partially contributed to the replication problem either directly or indirectly via distracting the DBA. While I can understand the thought process behind deleting the db2 replica and starting it again, there was a risk in these actions that was unfortunately realised. At which point, things started to go horribly wrong as all the back up issues were discovered.
The bit that is missing is why did all the backups fail? I suspect the backups and backup process had been tested in the past with the earlier DB versions. 9.6 is reasonably new (Sept 2016) so they may have had a working backup strategy up until at least then and arguably based on their issue tracker until mid-December 2016.
Why is this important? Read through the comments about testing backups and ensuring high availability. They probably had both until last month when they upgraded the database...
@theblackhand I was going to post much the same.
The backup plan was so broad reaching that it is very unlikely that it was never tested.
The article includes a bit about using outdated verions which "failed silently".
My suspicion is that the backup strategy was tested so comprehensively and had so many fail safes that everyone assumed that they were covered and neglected to check on a regular basis because it was "too good to fail".
All those posting that it was obviously never tested; reveal your position as an insider or other verifiable proof or STFU.
It seems like their setup was rather fragile. I'd put my money on not having enough geek horsepower to do everything they wanted to do. Having been in that situation many times. Even having a near disaster with lots of data loss(and close to a week of downtime on backend systems), company at the time approved the DR budget, only to have management take the budget away and divert it to another underfunded project(I left company weeks later).
One place I was at had a DR plan, and paid the vendor $30k a month. They knew even before the plan was signed that it would NEVER EVER WORK. It depended on using tractor trailers filled with servers, and having a place to park them and hook up to the interwebs. We had no place to send them(the place the company wanted to send them flat out said NO WAY will they allow us to do that). We had a major outage there with data loss(maybe 18 months before that DR project), they were cutting costs by invalidating their Oracle backups every night to use them for reporting/BI. So when the one and only DB server went out (storage outage) and lost data, they had a hell of a time restoring the bits of data that were corrupted from the backups because the only copy of the DB was invalidated by opening it read write for reporting every night (they knew this in advance it wasn't a surprise). ~36 hrs of hard downtime there, and still had to take random outages to recover from data loss every now and then for at a least a year or two later. Never once tested the backups (and the only thing that was backed up was the Oracle DB, not the other DBs, or web servers etc). Ops staff so overworked and understaffed, major outages constantly because of bad application design.
Years later after I left I sent a message to one of my former team mates and asked him how things were going, they had moved to a new set of data centers. His response was something like "we're 4 hours into downtime on our 4 nines cluster/datacenter/production environment" (or was it 5 nines I forget).
I've never been at a place where even say annual tests of backups were done. Never time or resources to do it. I have high confidence that the backups I have today are good, but less confidence that everything that needs to be backed up is being backed up, because in the past 5 years I am the only one that looks into that stuff(I am not a team of 1), nobody else seems to care enough to do anything about it. Lack of staffing, too few people doing too many things..typical I suppose but it means there are gaps. Management has been aware as I have been yelling for almost 2 years on the topic yet little has been done. Though progress is now being made ever so slowly.
The place that had a week of downtime, we did have a formal backup project to make sure everything that was important was backed up (as there was far too much data to back up everything(and not enough hardware to handle it), much of it was not critical). So when we had the big outage, sure enough people came to me asking to restore things. Most cases I could do it. Some cases the data wasn't there -- because -- you guessed it -- they never said it should be backed up in the first place.
Been close to leaving my current position probably a half dozen times in the past year over things like that(backups is just a small part of the issue, and not what has kept me up at night on occasion).
I had one manager 16 years ago say he used to delete shit randomly and ask me to restore just to test the backups (they always worked). That was a really small shop with a very simple setup. He didn't tell me he was deleting shit randomly until years later.
It could be the geeks fault though. As a senior geek myself I have to put more faith in the geeks and less in the management.
I am reading a lot of sanctimonious comments from people explaining how this could never happen to them because they always test everything and are well-prepared for a failure event. I'd like the people making those comments to honestly answer the following questions:
1) Do you presently have a spare can of fuel in your car?
2) Do you have a spare can of water in your car?
3) Do you have a torch (flashlight) in your car?
4) Do you carry warm clothes and/or blankets (in case you get stuck in a traffic jam etc. overnight)?
5) How regularly do you check the air pressure in your spare tyre?
6) When did you last check that your brake lights were working OK?
Yes, it is, but sometimes you need to understand the risks of doing too much.
Sometimes you need to just rely on your design, and after proving you have made it as good as possible, let it go. One example of this is the ascent stage of the lunar lander. This rocket was only fired ONCE for the takeoff from the moon. It was NEVER tested since the act of testing it with the fuels/oxidizers involved degrades/destroys the engine itself. They built it to be as bullet proof as it could be and over engineered it a bit more. It used a hypergolic fuel mixture and simplified fuel flows (I believe they used gas pressure to empty the tanks, and it had only one speed (ON!). Guess what, it worked EVERY time. As for my vehicle that I use every day:
1) Do you presently have a spare can of fuel in your car?
No, but I do watch my gas gauge, and if I forget, I have a AAA (us, AA - UK) card that will get me some.
2) Do you have a spare can of water in your car?
No, but on the time the cooling system failed (it was a couple of months ago), I could pull over and park, waiting for a tow.
3) Do you have a torch (flashlight) in your car?
Yes, it is only common sense. This is a small device that takes up little space, and has other benefits.
4) Do you carry warm clothes and/or blankets (in case you get stuck in a traffic jam etc. overnight)?
No, but in the cases where this might be a problem, I was traveling to a ski area overnight, and DID have some warm clothes I was actually wearing.
5) How regularly do you check the air pressure in your spare tyre?
While not on my vehicle, automatic pressure telemetry is now required on new vehicles. I do get my tires rotated on a regular basis (5,000 miles) and it is checked there.
6) When did you last check that your brake lights were working OK?
Thankfully the vehicles electronics DOES check this (modern cars!). As for older vehicles, no brake lights will usually get you rude warnings (horn honks) from people behind you. Good practice to check every so often, when servicing.
So while you do bring up valid points, overthinking things like this can get too extreme. Thankfully the faults described to not cause my vehicle to spontaneously destroy itself, whereas lack of a proper computer backup, can be catastrophic (to say the least).
7) do you have all of the above and a SPARE CAR?
8) do you keep the car engine running on (or at least a few hours per day) and driving a few miles and fuelled all the time? (= live backup system, just in case... )
9) do you have all of the above in a third spare car that's kept running, fuelled and road-worthy all the time on the other side of the continent? (= live backup data being kept in multiple locations)
and so on... the logistics of these things keeps getting more complex.
Maybe if invoked as root (maybe any user?) and arguments are '-rf' it should count the number of files it might delete, and say:
Wow over 1000 files, are you sure?
Me? Typically I do it without the 'f' option and see how it progresses, then abort and re-do with the added '-f' option as needed. I get very careful with recursive descents (with good reason!).
I don't have a company and I have backup of my backups. I never know when a hard drive fails. I only backup important stuff that I cannot replace elsewhere.
I would like to have double or triple sided backup elsewhere, but I have limited budget at the moment. I'll just work with what I got at the moment.
As for this company in question. I think lack of experience makes this type of errors resulting in large scale problems like this one. Also, bad attention in school when people learn about computers and how they actually work.
Have you tested the backup of your backups? Backing up a corrupt/bad backup will get you exactly where gitlab is. This story is pretty funny though. I'd expect better from a company with their name recognition. That rm -rf that was mistakenly run as a part of a replication process is one of the main reasons for automation. Also can't stop laughing at "The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented" and "Our backups to S3 apparently don’t work either: the bucket is empty"
I'm reminded of someone who made a mistake that cost a company a large amount of money.
The person concerned was called to the CEO's office.
"I suppose that you want me to leave the company" he said shame-facedly.
"Leave? We just spent over $1 million on your education. Just don't do it again!"
github (last time I looked months ago) hosts backup programs that most likely would have worked on gitlab database....
*************************************************************
shoot disregard this forgot they were separate entities.
left my stupid comment up to make this comment make more sense.