I'm not a DBA...
but even I got to the line that started DROP and I was squirming.
A reminder of the devastation a simple DROP can do and that backups truly are a DBA's best friend in this morning's "there but for the grace of..." Who, Me? "Stephen" is the author of today's confession and was faced with what should have been a simple case of applying an update to an Estimating and Invoicing system. The …
Same here... "Drop the database" elicited an "oh no... surely you meant *stop*" from me.
Thankfully, we keep the raw logs and transactional stuff as a backup for longer than the actual database, so we can roll through a year's worth of stuff to recover, but I don't ever try the 'DROP' keyword, ever.
Oh yes!
As a 25 year Oracle DBA veteran seeing the words "drop", "truncate" and even "alter" are enough to automatically take my hands 6" away from the keyboard. I've made my share of "blood draining from face" type screw ups over the years that deserve the icon shown, I seem to have trained myself to instinctively take hands-off keyboard at certain words or situations.
I did assume "drop the database" meant down the service, not literally DBA speak "drop database...", but there you go, the English language is a funny beast!
I'd say this is an object lesson for companies to ensure that they hire staff specifically for the purpose of controlling, maintaining and managing their hardware, software and data.
An Administrator of Systems if you will.
(Yes, I've been caught in the "you're technical, you can handle this" catch22 before now. Never again, thanks).
I've said it before and will probably be saying it on my death bed, but many (most?) companies only notice system admins when they get something wrong. If you're doing what you should do all you get is "just what do you do all day?" (or even worse "why do we pay your salary?").
What is needed is some way to ensure a critical system "goes catastrophically wrong" every now and then, only to be fixed (after a suitable delay) by the heroic sysadmin who saves the day, without the users ever catching on to what's really happening.
"What is needed is some way to ensure a critical system "goes catastrophically wrong" every now and then, only to be fixed (after a suitable delay) by the heroic sysadmin who saves the day, without the users ever catching on to what's really happening."
Yes. A market niche Microsoft have been taking care of for us for a few decades now...
The most ironic aspect is the existence of that one admin we all know. That one who cleverly architects catastrophes such that they never appear to be at fault but somehow are always swooping in at the critical hour to orchestrate a heroic fix.
Meanwhile, behind the scenes, the best people design and administrate systems such that the fire alarm fire never happens in the first place. And they never get credit because nobody knows that they're doing God's work day in and day out to keep everything nice and boring.
DROP...
Deltree *.*
Kill "C:\" (a VBA classic)
rm -rf
:(){:|:&};: (please dont use this on something that you don't regard as expendable / rebootable).
Amongst another commands that exist for legitimate reasons but very, very easy to misuse. Deltree was particularly vicious, seen as it doesn't operate from the currently selected directory, but rather from the drive selection.
Actually it might be better readable like this:
:()
{
:|: &
};
:
The mean joke is declaring a function named ":". The rest is rather self explaining. Call ":" from within ":" pipe it through ":", fork all created instances of ":". I'm not completely sure if the function will ever return but whenever it does, each thread will call all the bloody mess again.
Not much use tracing it because the machine will lock up completely as soon as you hit RETURN.
Well, the core of it is that in most computers, the buffer is basically a "to do" list for the currently running software. In these attacks, the buffer contents might be something like:
* An instruction saying, maybe, "print the next thing"
* then a tag saying the next 16 bytes are a single string,
* then those 16 bytes,
* then there will be the next thing to do after you've printed that string.
The trick is that if you somehow put _17_ bytes in that 16-byte string, it overwrites the next byte as well.
The computer will then try to do the next instruction... which the attack just replaced, and there's your entry point.
Not all computers, and not all programs, will actually use the buffer in sequential (or otherwise predictable) order, but it sure is a lot faster if you do, creating the potential risk.
A good explanation of this is "Smashing The Stack For Fun And Profit" from Phrack 49.
The technical details are slightly outdated (it's from 1996), but the basic concepts are as relevant as ever.
Then theres the optimal way to do the same thing with a fraction of the effort. The hacker way. Goes a little like this:
Download a gig of RAM tarbells from the cloud and zip them up. Run netstat on the RAM until the symlinks are immutable. Rinse and repeat. Notice every jpeg on the blockchain has been embedded as HTML bit flips on the virtual machine microservice partition.
Check your MySpace, and BAM, you have 1 million more friends than you did 5 minutes ago.
You're welcome.
I deal with all sorts of disks that need re-purposing for test builds, image deployment etc. As a result, I use Microsoft's diskpart.exe to 'Un-initialise' disks quite often...using it's 'clean' operation.
I automate a lot of operations, but sometimes manually is the only way to do something. My code and comments are full of dire warnings about using 'clean', and the pre-execution checks are very robust, erring on the paranoid side of things:-)
There is an entry in our local 'wiki-ish thing' about re-initialising physical media and it's full of warnings like '[BE EXTRA SPECIAL BLOODY CAREFUL!]' and '[SERIOUSLY, BE REALLY CAREFUL - THERE IS NO WAY BACK FROM THIS]'.
So far we've been lucky and the warnings have worked...which is a shame from my POV as I'm in favour of some of the 'as fast as possible, ask no questions' types learning a really hard lesson sometimes - my rules are 'You fuck it up, you fix it' (Right up to the point where people fess up to not knowing how, at which point we do actually help - and educate - them. Two lessons for the price of one!).
Doing a 'detail disk' and reading the output is now more or less muscle memory for me:-) I'm just as cautious with re-initialising media on Linux...blkid is your friend..
After overwriting my main system's primary drive (instead of the intended USB stick target) with a Clonezilla ISO image, I put together a tower system which had pop-open hard drive trays, a tape unit, two CD/DVD units, and no internal hard drive.
This computer's sole purpose in life is to act as a host for data transfers, data recoveries, and hard drive testing. Lacking an internal hard drive, you boot it via CD/DVD or USB stick. Having a brain-fart while using dd, clonezilla, or badblocks on this PC is far-less potentially-disasterous than doing the same on one of my other computers.
RECOVER
Sounds safe, doesn't it? Just the sort of thing that should be the matching command for BACKUP under MS/DOS. What it actually does is delete any directory information, then reconstruct what files it can as FILE0000 to FILE0127 in the root directory. You had more than 128 files? Well now you haven't.
I'm not a DB person, but as an on-site engineer, I once turned up to a customer site where a DB had gone titsup. My first question was "Do you have a backup?" which was answered by someone proudly holding up a box of tapes. The next question was "How do we restore from those?" This was answered with embarrassed silence and looks of complete confusion.
Luckily the DB man was on his way and turned up to fix everything without needing the tapes.
A friend of my worked as an IT consultant and told of a company he occasionally visited, where the head of IT would pick a random day to turn off a random machine and tell his team to recover, just to prove that their processes were correct. He must have had a lot of confidence (and balls of steel).
Back in my early days as a VMS operator, the IT director once came down into Ops central, marched onto the machine floor and boldly flipped the Big Red Switch that switched the power off to the entire data floor.
Cue clenched sphincters as we waited (and waited ... and waited) for the backup generators to kick in before the UPS died. Then they marched out and simply said, "We've had a power failure. Call DEC and put the disaster recovery plan into action."
This was, apparently their way of conducting a full resilience test - no, DEC had not been pre-informed of the test either - as far as they were aware, it was a genuine disaster - and the recovery plan involved them trucking in duplicate hardware for all our key machines on what was effectively a mobile data centre. Must have cost [i]someone[/i] a hell of a lot of cash to put that thing into mobilisation.
Yes. And as a one-time member of DEC's so-called "flying squad", I'm here to tell you that we were often advised that, although it would be quite functional, and would lead to overall efficiency of IT operations at such sites, we were NOT allowed to throttle the IT Director. Yelling at the twat was allowed, however, which I'm sure helped minimize our blood pressure.
Can you imagine if companies provided the budget for something like this, instead of cutting everything to the bone in the service of Just-In-Time Lean Diets with zero slack space for anything going wrong? For one, the ransomware epidemic would simply not exist, everyone would just shrug and restore from backup like they did every few months.
Almost right. I once had to persuade management (and the web developer) after a bad SQL injection attack on our main website that it wasn't enough to simply restore the site but that it had to be offline while the failings in the code were fixed otherwise we would be on line for 15 minutes before crashing again (leaking all sorts of confidential information and your truly cleaning up the mess again). You can restore the data but you must fix the hole or the ship sinks again. Fortunately I succeeded but only just.
I used to work at a bank, where there were regular disaster tests. Disaster was simulated by shutting one of the data centres down and checking if all services were still available as they should. Next time they'd take another data centre. Yes, in the daytime (though on a weekend when the securities exchange was closed).
That's just one of the reasons I felt proud to work there.
On a somewhat related note, I read a paper about a decade ago that advocated _against_ orderly shutdowns. The author's position was that systems should be architected in such a way that an uncontrolled shutdown not do any damage, and that to test that robustness guarantee, one should always shut it down hard -- and to ensure _that_, one shouldn't even create a clean-shutdown facility in the first place.
E.g. for a desktop O/S, that would mean no "shut down" menu option, command, or whatever; a hard power-off would be the correct, documented way to bring the system down.
(I've tried searching for that paper again, but without success. Can anyone point me at it, by any chance? I'm pretty sure it was an academic paper, not a blog post or the like.)
This is a bad idea, because it requires the programmers of every app you run to be perfect.
A successful clean shutdown requires more than file system consistency. It also requires application state consistency. If halfway through your database update run, the OS force-closes the app's files, then terminates the app, well yay, your filesystem is clean, but your app state is dirty. Yes, I've heard of checkpoints, and well-written apps make use of them, but don't expect the programmers of QuakeCrysisDukeofDuty to put in a superfast emergency gamesave feature. More-mundanely, you also will lose some changes and text in that program you were editing. Going back to the last good save, and trying to remember all the edits you did since then will be a painful and error-prone "recovery" method. Perhaps your editor keeps some sort of edit-transaction-log you can have it replay. Does your editor do that?
Back when I worked for Bigger Blue (late 1970s), on Fabian Way in Palo Alto, we'd kill the mains power at 3PM on the last Friday of every month to ensure that the battery would carry the load long enough for the genset to warm up enough to take over. In the event of failure, everyone went home early with two hours pay ... This last never happened while I worked there.
I briefly worked as a relief system admin while they recruited someone permanent in a company where the manager did that. It wasn't just turning a system off, he'd also sometimes just unplug random cables.
Fortunately, his idea was to test how the environment coped, and also to see how long it took us to identify what he'd done, in order to get it working again, rather than a full system restore test.
I seem to vaguely recall that back in the mists of time one of the big backup software houses released a version that produced corrupt backups but this wasn't spotted for months until people started to need to use those backups, only to find they have months of meticulously taken backups none of which actually worked
I once got called to a customer hardware problem. Job: replace the failed tape drive. Got there, confirmed the drive was faulty, swapped it out, set a test backup going. Failed. Tape beyond expiry date, system refuses to use it. Ask for another one. Same. Check their actual backups. Same. They'd expired about a year earlier.
Turns out the backup was scheduled to run at 9pm, so someone was tasked with swapping out the tape before leaving for the day. It went in ok, but instead of being ejected after the backup completed, it was rejected and ejected at 9pm when the backup process started and immediately aborted. When I reported this to their head IT admin, it turns out the emails were being sent to an account of someone who left two years ago and it was highly likely this was happening at every one of their 120 remote offices. Oops!
Let's not even bother with asking if they ever tested their backups :-)
Back in the day, I was ribbed by my cohorts at one installation for backing my programs up to paper media (punched tape, punched cards) instead of far-more-convenient magnetic tape. My thinking at the time was that punched tape readers -- every ASR-33 Teletype had one -- and card readers were far-more prevalent and standardized than magnetic tape units.
But there was an aspect I had not thought of. This installation's budget was largely dependant on government grants, and there were a few lean years during which management deferred what previously had been regularly-scheduled hardware maintenance. Eventually we got another big chunk o' money, and the DEC hardware techies were scheduled to come in over the next few weekends. The first weekend they maintained our four TU77 tape drives, which included head alignment. Come Monday, jobs started failing due to tape read errors ... d'oh!
In the late 90's I was called to rescue an occupational health company that had all of their information (and when I say everything, I really mean everything: clients, contracts, test results, payroll, every last bit of business information they needed) in a single 700 MB Access file. An hardware failure crashed the .mdb and their most recent backup copy was over a month old. Unfortunately (for them), we were unable to restore it properly, only managing to salvage parts of tables' content, unlinked to anything else.
They ended up losing several contracts over the issue, but do you think they learned the lesson?
Well no, they rebuilt from the backup copy and manually inputted all the missing information from what we'd recovered and their paperwork, keeping everything else as it was....
In the days of 3340's which you could physically pickup and mount/unmount (and looked a bit like the starship enterprise), spinning disks etc.
One of our testers who was an operator in a previous job, had had problems with the disk containing the master database for the banks customers. He called over the senior operator who said.... we had better try it on a different drive in case the drive is suspect.
It didn't work there either - so it must be the disk. The got out the mother disk. Yesterday's database is copied to a different disk and the batch update run to make today's database (so Mother database begats today's database).
That didnt work either, so the senior op got out the Grandmother disk from the manager's cupboard. Mounted it - and it didnt work either.
So they phoned the manager who said "that's ok - just do not touch the grandmother disk".... "Ahhh too late - came the response".
There had been a head crash on the original disk.
Mounting it on a different disk drive damaged the heads of the second disk drive.
The mother disk was corrupted by the damaged heads.
The grandmother disk was then damaged by the damaged heads.
Fortunately they had a copy of the database which was only a month old, and could reapply the overnight changes which took about a week to do.
And that's when the tested decided to join our company where he could do less damage.
Yup, and on the Honeywell mainframe at my university. Two drives and three disk packs later...
Fortunately, I was nowhere near the machine room at the time.
Twice-daily tape backups were robust, so I don't suppose much data was lost.
(Perhaps as a result of that incident, but I'm not sure) one guy made a point of eyeballing disk packs before mounting then, looking for dust etc., and taught me to do the same. Not sure how effective that precaution might actually be, but it couldn't hurt...
Oh dear. That reminds me of an embarrassing moment in an open plan office. I was on the phone to a customer who managed to dig himself into a hole. It was a bad line and I had to speak loud. Suddenly, the entire office went quiet as everyone, by chance, stopped talking. It was into this silence that I heard myself bellow "Have you mounted your grandmother?".
It took about three seconds for the first chuckle to be heard, but within 15 seconds the entire office was filled with guffaws.
Yup, managed to do this twice where I am now and fessed up immediately. First time I was just a contractor and I reckonned I'd be looking for another job next day. Turns out I told exatly the right person who then shouted over to another tech 'Hey, you know that button in X that we should never press..." and all had a sigh of relief that it wasn't one of them that'd done it. They managed to get most of the data back and the rest was rebuilt over the next few weeks but it was noted when I had my interview to go permanent that my handling of the incident was one of the reasons I was being made permanent.
My other memory of that interview was there was a poodle sitting in on it (yes, a real one - but that's a story for another time).
Second time I realised within a few seconds what I'd done (removed the access group instead of removing the user from it) and contacted the one pwerson who I knew could do something about it ASAP - as a result no major fallout and another potential hole removed.
Stephen, whether he liked it or not, was, in effect, the DBA for the system. After all, there was nobody else in that role. That applies to anyone else in that situation.
In that situation he needed to acquire the two essentials. No, not what you're thinking, definitely not those.
1. The required level of paranoia. (Extreme)
2. Detailed knowledge of what he's doing.
Note the order.
The database is the equivalent of all the paper records the business might have had otherwise. Operating on it is the equivalent of opening all the filing cabinet drawers and peering into them with a lighted candle in one hand and a jug of petrol in the other.
That's literally what the transaction log is, and it can be stored over the network rather than locally. Sure, some software keeps an immutable ledger at the application level, but those have no protection at the console level; a transaction log is still the only true immutable history of state at any point at that level.
Prune a transaction log without a tested restorable backup at your extreme peril.
It's an interesting concept. What do you do in the event of a legal requirement to remove some data, for instance a data subject right ot be forgotten request?
It's the potential of destruction of media, or even the entire H/W that needs to be dealt with. Before I moved into IT I'd had the experience of my workplace being bombed (fortunately not very effectively) and burned (rather more effectively) so my subsequent thinking was more in terms of getting regular backups into a fire safe and preferably off-site.
There's a few approaches.
First is make sure that your transaction logs include the removal - that way, any restoration done using the logs will comfortably re-delete the data.
Second depends a lot of the business justification clauses of the relevant laws. Like, "will the Tax Office expect these records to be available" is a question my lot ask themselves, and if the answer is no then you can prune the transactions. If it's yes, then the logs will have to stay for the usual period (varies by what you're doing, but IME it's rare to have any valid use for identifiable customer data older than 7 years or so).
I've talked with a friend where all customers identifiable stuff is hashed, and a master table is used to decrypt each individually. To delete a customer, you just kill their line in the hashing table, and bam, every single bit of information that's PII is inaccessible to you. This means you can keep policy numbers or invoice dates/amounts to make your accounts add up, and just lose the customer's name/address/DoB/etc.You do pretty much need to build everything around it, mind, it's not easy to retrofit.
I think there is an issue with having the removal in the transaciton log, since either the data still exists in the transaction log (in which case it has not truly been deleted) or the data is truly destroyed, in which case we are right back where we started with a mutable history.
There will always be friction between the goal of "make it impossible to accidentally permanently delete things" and the legal requirement of "permanently delete certain things".
Yeah, approach 1 is debatable. The way the law is written you can retain business necessary info, amd right now there's not that much case law. Is possible the ICO would accept "we retained a restore in case of supeonas, but the data is unavailable to normal processing", it's possible they might not. Likely depends on the nature of the business.
See my comment above on your first point.
Of course destruction of media is not an issue that can be solved in the database itself (well ok, there is stuff like ECC, but that's not a cure-all). I see it more as, lets separate the concerns of database interaction fuckups and hardware fuckups. Defense in-depth and all that.
"1. The required level of paranoia. (Extreme)"
Indeed.
My home backups take place ad hoc, since the backup server doesn't run 24/7. I power up the latter and run the script when I think of it.
One time I set out to transplant my laptop's drive into a new laptop. Take the several hours to back it up first? "It's a simple thing I'm doing. What could go wrong?" But, being the impatient but paranoid sort I am, "OK, _fine_," I sulked at myself. "I'll do the right thing: run a backup overnight, and move the drive in the morning."
What could go wrong? Post-move, the drive failed to spin up. Nor when reinstalled in the old laptop. Dead, kaput, pining for the fjords, yada yada.
No new lesson was learned, but suffice to say, one that had previously been learned the hard way was powerfully reinforced.
Side note: introductions to computers often talk about the CPU being a computer's brain. Whatever. The HDD (or latterly, SSD) is its soul.
I'm retired. This year, I have successfully got all of our IT down to two iPhones, two iPads and one iMac with some stuff in the cloud. The only content stored locally on the iPhones and iPads has been downloaded from the cloud. The iMac has three 16-day rotated Time Machine backups (one in the firesafe), two monthly rotated Carbon Copy Clones (one off-site), and two monthly rotated file copies of the cloud (one in the firesafe), and one disk with everything copied annually off-site. I replace 1-2 disks a year. Total cost about £1,000.00. Paranoid - Me? No, I I've just been using IT professionally for 51 years...
"My home backups take place ad hoc, since the backup server doesn't run 24/7. I power up the latter and run the script when I think of it."
I recommend a Pi, a large USB drive and NextCloud with the NC client software set up to sync all the directories where you might put things. The server has an area shared with SWMBO so that when I put together her class notes PDFs she can see them to email out to the class.
I still recommend backing up your /home partition or whatever other location your OS keeps your data on before doing anything drastic.
I recall, many years ago, taking over IT admin for a small organisation that ran everyone off a single server. When the original IT manager left, the General Manager passed IT admin over to the office manager. She was sent on a three-day course and, on her return, followed advice given and updated all the system passwords. Backup tapes were changed daily and weekly and stored in a fireproof safe - but never actually tested (neither did she look at the backup logs). Following an IT issue she couldn't solve I was handed the IT role (part time as I was actually employed in a non-IT role). After fixing her problem (it must have been straightforward, as I don't recall what it was) I started to delve into system logs and quickly spotted errors reported for backups. Further delving and it transpired she'd not given the backup program the updated passwords and all the backup tapes for the six months she'd been in the IT role were substantially blank. At least I wasn't going to need to buy any new tapes for a while, once I got the backups running properly...
We keep coming across these things.
Backup consisted of an overnight copy to the hot standby at the other end of the site. I don't remember the details - maybe it was a change of permission that allowed/disallowed read access to the backup UID - but the backup would be terminated before the morning shift started. For a very long time the overnight slot hadn't been long enough for a complete copy and nobody had checked...
Fortunately this was belt and braces - there was also a tape copy but I'm not sure the tape formats were compatible between the two machines.
Anon as I work for the company still and so do the other people.
There was an issue with an application running on a database where it hadn't imported data for some days, our overseas team that wrote the software gave a bunch of fix steps to run through the first of which involved "drop", the rest of it "worked". The person running the fix steps is not a DBA and does not know what the commands meant, was just told, it would fix it.
Queue the customer saying, "there's no data".
Cue the restore and then replaying multiple days worth of data that took a couple of days to complete.
DBA rights promptly removed that day to everybody that was not in fact an actual DBA.
Also, steps that any "run these list of commmands" issued from any other team was reviewed by multiple people to determine what it would actually do.
Fortunately, I did not get sucked into that maelstrom, just managed to snigger from afar.
One of our customers has a database which stores, well, basically everything, including all the sales for both their website and physical stores.
It has one, single, user, which has full rights to everything. The PoS software uses this admin to write it's sales into the DB, as does their web store, and everything else around the company.
Fuck knows if it's even being backed up anywhere.
If it's anything like where I work, yes, it is being backed up. Into a backup which has never been tested, while the backup account is an AD Domain Admin.
I was actually shocked to find that the DB service didn't use the same cred. No, the "SQLService" account is running ALL the DBs in the entire org, multiple DB farms, scores of servers and applications.
Yes, I have got copies of the emails where I've pointed this out at length to managers and security team, multiple times.
"they can pretty much disallow almost anything in their DBs"
Quite. Stephen failed to backup the DB probably because he was not logged in as a backup operator.
"Surely I'm smarter than this issue," Stephen thought to himself.... Nope. You don't know what you don't know. Once the software has demonstrated your ignorance to you, it's time to step back and RTFM (or consult the greybeard, call the vendor, etc.). When I was a callow youth I (oh so delicately) brushed the horse fence with the backhoe. The owner was understandably angry, but gave me a valuable piece of advice: "When you are unfamiliar with the equipment, GO SLOW!"
Ahhh.
Like the time I received a call from a customer telling me our software wouldn't talk to their database. The conversation went like this:
Me: Can you see if the database is running.
Customer: It seems to be.
Me: Hmm. OK. What processes can you see.
Customer: a couple of ksh processes.
Me: That's not your database. Can you try and start it.
Customer: I don't know how. I don't usually do this job.
Me: OK - cd into this directory....
Customer: I can't - it's not there.
Me: Huh? Where is it??
Customer: Well, we had run out of space so I deleted some files....
Yep - the customer had deleted the entire database. Not only that, they hadn't had a successful backup for several weeks!
"Well, we had run out of space so I deleted some files"
That is sometimes (well, often in my experience) caused by crappy application software. It does:
errorhandler: if err=cantsave print "The disk is full - delete something"
instead of
errorhandler: reporterror; if err=cantsave print "Couldn't save"
I had this where the underlying error was "user account run out of allocated space", *NOT* "disk full". The solution was to credit the user with more space, not to go trawling the disk desperately deleting things, and wondering why the damn thing STILL wouldn't work with a disk 99.99% empty.
Used to work for a company (a long time ago) that religiously pigeon-holed roles and responsibilities. Even down to scripts that did something as simple as a backup!
So I wrote a script that backed up data in the following way.
- Write a header describing the backup contents
- Write my application files
- Call a DBA written script to back up the database
- Write a trailer file to show the backup was complete
Worked great for months and months. Then one day we had a problem that required a restore. Off I trotted to the tape store, put in the tape, read the content expecting to see a header, my stuff, a DB dump, and a trailer file - only all I could see was the trailer file. Odd! So I go back 3 MONTHS - all the same problem. So I am confused. I changed nothing. I verified it all worked fine during testing and validated for weeks after it went live. I was in for getting a bollocking....
I did a bit of investigation and finally found that one of the DBAs had made an unauthorized change to their script, and made the tape device a rewind device instead of a no-rewind device. After his DB backup, it rewound the tape to the beginning just for me to write a trailer file!
I worked at a PC shop where we had a policy of having people sign a waiver stating they'd made a full backup of their system before we'd work on it, or, if they hadn't, we'd make one for them at standard shop rates.
One day I found myself in a race with a plug-your-ears, shrieking-bearings, dying Miniscribe 3650 hard disc to complete the backup before it froze. I think I won -- the restore to a new hard drive completed, a CHKDSK showed no file system inconsistencies, and the customer didn't call back complaining about missing or corrupted files ...
(Icon for excessive noise ...)
Not specifically DB related, but when I was on tech support for Dixons/PC World virtually every computer for a while*** was supplied with an unimaged recovery sector and disk-imaging utility, and no Windows disk. It was a cost-cutting exercise so retail prices could be kept down (or profits kept up, depending on your level of cynicism).
The first time each machine was powered up it asked you to create the recovery image, which in most cases was burned to CD/DVD. However, you could cancel it. And most did.
If I remember, it was time-sensitive, and you could only create the image for so long after initially registering Windows.
I can count on the fingers of one hand (exaggerating) the number of people who'd actually done it, and the ones who had often only did so after they'd called in early on and I warned them to do it pronto (and stick the disk in a sleeve, then in a box and a safe place, but not in the cutlery drawer, because it was rather important). Most didn't.
And don't get me started on the reliability of the imaging process when run, or of the reliability of the alleged image when used later in many cases. We had to send out replacement master disks regularly.
*** And in a different phase of machines, the recovery sector simply remained on the hard drive, which was fine unless the disk failed. Worth noting that one of the main parts replaced by us was... hard disks.
This post has been deleted by its author
We had one dept in a dark corner of the building that ran their own little obscure system and ran their own backups to a tape machine. When called on to help with a problem I was rather surprise to see the absolutely drop dead gorgeous girl in charge of backups pop the tape out and put it back in again. Turned out they'd only got one tape and when the system got big enough to need two tapes they just popped the tape out and put it back in again when the machine asked for Tape 2. Not surprisingly the problem was solved and due to the application of the absolutely drop dead gorgeous girl in charge of backups protocol the whole thing was solved with immense dedication and thoroughness, Indeed the protocol even now slows my typing to a crawl as no PFY or even BOFH wanted to leave her presence faster than necessary or even consider incurring her displeasure. In her 'defence' she was actually very good at her job bar the backups, which were probably handed over to her to prevent PFYs getting lost or malfunctioning for the rest of the week,
For a while it was common for the RDBMS to own the file system, some even had their own file systems… While that didn't make backing things up any easier, it also made deleting stuff harder.
I've hosed a Postgres DB in my time (always able to restore the data) and was surprised at how well I was able to restore from a disk backup.
But what got me about the story is that the update instructions included a DROP DATABASE. That's a mistake in any language. At the minimum there should be backup and restore steps. Don't touch it if there aren't! But, also, where was the test? No production system should be changed until everything has been shown to work on the test. If manglement won't approve this then it's time to get out because you will get the blame when it inevitably does go wrong.
I remember sometime ago a co-worker learning that deleting records from an in-memory array meant it could also delete them from the database, depending on the parameters used for creating the damn thing.
Not a fun weekend for him, recovering information from backups and transaction logs but at least he learned, as did others (by example, natch), that everything should be thoroughly tested before deploying in production, even seemingly small changes.
One of our clients had an MSDOS-based computer, with a 40MB streaming tape unit. We showed the PC user, an accountant, how to make backups, and how to test them. He nodded happily, and thanked us for our efforts.
Three years later, their offices were broken into, and that PC was physically stolen. Turns out, he'd made a grand total of ZERO backups. They fired him, but that didn't fix their problems.
(icon for "being shown the door")
Being a careful chap, he also took a copy of the directory of the application from the Programs Folder. Just to make sure he had a copy of the data.
When i first read that I assumed he knew for damned sure the file(s) he was backing up was the file(s) holding the data.
Seems not . you' think he'd know by the size if nothing else.
Software that I don't touch save when there's an issue, needs rebooting, etc
I ran our entire company ERP system off a Postgres database, which never needed rebooting in all the years it was in use under my control. I was a bit perplexed until the mention of the 'Programs Folder' when it twigged that it must have been running on a Windows machine.
In my experience Postgres is a fantastic bit of software that 'just works', especially if running on top of a stable Linux distro. Backups were automated and restored daily as a almost live 'play' system, allowing authorised people to try things out to see if it was going to work as they might hope. After I retired, the new BOFH also kept the transactional data during the day allowing much finer backups if required (they weren't) and I understand he's now virtualised everything too, without a hiccup.
I'm no DBA, but have a long chequered past involving Dbase III programming, Access & MySql. I can RTFM. There is simply no excuse for not reading the manuals and working out the basics before putting any valuable date in there. Cost is no excuse either as we even used a Raspberry Pi to host a copy on Postgres while working from home.
"kept the transactional data during the day allowing much finer backups if required (they weren't)"
But this is why we do backups including the transactional logs. You do them in case they're required for production. A restore to a test system is a bonus. You always hope they never will be required but it's knowing you can do a resotre up to the last checkpoint or, preferably, up to the last commit that lets you sleep at night.
It wasn't a criticism at all, and I was happy to see him get round to doing stuff I'd never had time to do. (For the record, I was running the company and tended to the IT side in my 'spare' time). Like many small companies we never had the spare cash for a dedicated BOFH - until the rise of Win10, and I decided that I'd had enough.
Backups are like like most insurance policies, it's essential. but you hope you never have to use them. And bad backups are like cheap insurance ...
I was once doing support for a CMS on the side.
Said CMS was never designed for large numbers of users and got seriously bogged down at one occasion. I found that the performance issues prompted many users to create multiple accounts (they never received a confirmation, but user records were still created), and that further slowed the system down.
So, I had the bright idea to clear out those half-broken accounts until things would easen up a little.
Opened up the SQL query tool on the server and prepared a DELETE command.
Sent the command off and then it dawned on me that I had forgotten the WHERE clause...
Imagine my relief when the query tool refused to execute the command because it was missing the semicolon at the end.
"Opened up the SQL query tool on the server and prepared a DELETE command.
Sent the command off and then it dawned on me that I had forgotten the WHERE clause..."
Which is why I always do a select blah to check the data that will be effected before doing the delete.
What self-respecting male asks for instructions!? That's grounds for instant termination of one's man-card!
But was peripherally involved in a similar situation way back when I was still in school. By virtue of being the offspring of someone involved, I wound up as tech support for a small non-profit (literally two part-time people and me on an as-needed basis). The government program they received funding from required that all the information be entered into this custom DOS-based DB app. It was a real turd as you might imagine from something created by a government organization that didn't specialize in software. IIRC, the entire thing ran off a single floppy drive. Being in my early teens the idea of a backup didn't even occur to me, and this was still very much the era of the 3.5in floppy. The exact series of events elude me now, but one day I get a call because there's a message on the screen congratulating the user on creating a brand new database, and wanting to know if there was anything that could be done to restore the old data. Even if the raw data was still sitting around somewhere, the software designers apparently never considered the possibility that people might want to load in a database from a file, so many hours of tedious manual entry ultimately had to be redone.
Even if he wasn't a DBA by trade, the fact the backup failed when tried manually should have raised red flags a mile high. He clearly wasn't "smarter than the issue" and mea culpa or not, he should have been out on his arse for gross negligence after blindly continuing with instructions he thought he vaguely remembered
'drop' and 'backup' brings back one the programmers doing a recreation of 'the Odessa steps" sequence , only instead a pram its one the laptops going down the stairs from the conference room.
Bang
Bump
Bang
Bump
Crash
Of course we have backups.... new laptop duely found, CAD software installed, just link it to the server and away we go ....
"What do you mean? you cant find your models or files? where were they?...... SAVED ON THE DESKTOP OF YOUR OLD LAPTOP???!!!!!!!! FFS you are kidding.. what about the directory on the server where everyone else saves stuff?"
Cue retrieving dead laptop from skip.. and pulling the HDD for copying
The very first time I logged into a production Oracle database - was to restore it for a client I had never worked for. It seemed that Oracle support had sent their recovery consultant and he said they would lose 2 years of data and there was nothing that they could do. So the client called me and asked for help. I asked different questions and got a full restore and recovery.
A year later, I joined Oracle and was at an awards ceremony where that consultant was awarded "fireman of the year".
Since then, I have restored many other DBA mistakes when they didn't have proper backups. But with Oracle, as long as you have archive logs, you can probably restore. I have never failed to restore a production database.
That is because I practice restores all the time as part of routine operations. Because, if you haven't practiced restores recently, you don't have backups.
Perhaps fifteen years ago, I got a call from a co-worker:
Co-worker: I tried to query IMPORTANT_TABLE and the system says it doesn't exist.
Me: [After a quick check] It doesn't.
Co-worker: ???
Me: [After another quick check] As a matter of fact, you have about three tables left in your schema.
This was Oracle 8 or 9, though, and it was quick to get them back from the recycle bin.
Well before that, some co-workers found that ANOTHER_IMPORTANT_TABLE in a different database kept disappearing. I don't know how that happened, but I wrote them a DDL trigger that would raise an error if one tried to drop certain tables.
I had a similar issue (IMPORTANT_TABLE gone AWOL) happen once, and was *remarkably* glad that it was actually some arcane failure in our TNS.
Sure it took a week for tech support to un-pluck that turkey, but the table still existed for everyone else, I just couldn't see it from my machine.
1) Upgraded a customer to the latest version of MSDOS. Did a backup, installed the new version, did a restore. Only I didn't do a restore because the format of the backups wasn't compatible between the two versions of MSDOS
2) Customer running two business on 2 MSDOS computers using the same software. Unfortunately he was using the same set of floppies for each, so the only backup he had was No2 machine as that backup had overwritten the No1 machine's backup. Luckily he didn't have to restore No1 machine before I discovered what he was doing. Why did he only use one set of disks? Because he was a tight-fisted accountant.
I knew a VMware admin whose backups were VM disk snapshots. Luckily I was a mistrustful fellow and had 2months prior requested the vCenter metadata export and offsite+offline full diskdump of the data stores.
Since I used to do daily test reports i had all infra details to the hour mark ... Come the inevitable VMware patch and subsequent data disk crash ... We spent near 5 weeks restoring config and infra with 2 month old data ... New stuff we copped up to data loss.
Moved out after and site was running 2 years unattended due to good rotate and cleanup and auto recovery scripts we developed and configured.