back to article To avoid disaster-recovery disasters, learn from Reg readers' experiences

How can you avoid a disaster recovery disaster? You can find answers in the pages of The Register, specifically our reader-contributed tales of tech support triumph and terror: On-Call and Who, Me? Close up of tangled tape I was told to make backups, not test them. Why does that make you look so worried? READ MORE We've …

  1. Eclectic Man Silver badge
    Facepalm

    'reviewed ... by hand'?

    We've carefully reviewed both columns, by hand, plus perused unused submissions in both columns’ inboxes

    Seriously, haven't heard about AI? Just feed the articles into ChatGPT and ask for a summary.*

    But seriously, I was once consulting to a Uk Government Agency whose HQ was located in Bristol. Their plan for their essential system was that in the event of a disaster in Bristol, they would use their backup site in London, 2 hours later. I pointed out that unless they had a Harrier jump jet waiting on the roof** for each member of staff they needed to transfer from Bristol to London, they would never make the trip in 2 hours, let alone get there in time to load up the DR system and get it running in that time.

    Moral of the story - it is not just the IT that counts.

    *(Then you'll be really f**ked.)

    ** OK, I might not have said the bit about the Harrier out loud.

    1. Jou (Mxyzptlk) Silver badge

      Re: 'reviewed ... by hand'?

      According to google maps: Two train connections, both in less than 1h40m. However, not knowing the exact location within those two villages, a few minutes of extra walking are to be expected.

      1. Richard 12 Silver badge

        Re: 'reviewed ... by hand'?

        Those trains are roughly every half hour, so that's a 15 minute average wait. Except at night when they aren't running at all.

        After arriving at a London train station, it's generally 30-60 min to actually reach most London offices (unless it was right outside the correct train station)

        Aside from that, they were likely assuming 2 hours of working time to bring it up, not 20 minutes.

        I've noticed that very few managers seem to understand that travel time exists.

        1. Eclectic Man Silver badge

          Re: 'reviewed ... by hand'?

          And if the disaster from which the client is recovering is sufficiently serious to have affected the rail service adversely too?

          BTW, their office was about 5 miles from the railway station in Bristol, congestion on the roads could have meant that car travel there would have been prevented.

          1. Anonymous Coward
            Anonymous Coward

            Re: 'reviewed ... by hand'?

            Coaches, the DR centre I worked at had ample coach parking space and we had coach companies on speed dial so we could fill our floors with up to 600 of your staff on a first invoked first seated basis

            1. CrazyOldCatMan Silver badge

              Re: 'reviewed ... by hand'?

              At one point, we hired a DR/BC service from HP where they would supply a lorry, already loaded with hardware like ours and all cabled up so that we would just need to provide power, backup tapes and a network attachment.

              Eventually, we decided to pay for a test of the setup. It failed, miserably. The hardware wasn't remotely like what we had, the SAN was different so we couldn't import the config from the old one and the network connection was incredibly slow, even though it theoretically was identical to our live connection.

              Given the eye-watering cost of the service, we asked them to fix the issues and re-test.

              They wanted to charge us for the changes. We pointed out that, not only had it failed but that they'd directly breached the terms of the contract.

              We cancelled the contract the next month.

        2. breakfast Silver badge

          Re: 'reviewed ... by hand'?

          And YET managers have the audacity to complain at me when I act as though the time I leave the house is the same as the time I'll arrive at the office, regardless of my commute time.

        3. Jou (Mxyzptlk) Silver badge

          Re: 'reviewed ... by hand'?

          Thank you for the real world follow-up! Me, being on the other side of the canal, would have guessed, by the size of both cities: "30 extra minutes Bristol, 60 extra minutes London". But I hate guessing. So I used the "a few minutes" clause.

    2. Anonymous Coward
      Anonymous Coward

      Re: 'reviewed ... by hand'?

      When, and where, I was working in Bristol 50 years ago, having a Harrier ready wouldn't have been too much of a challenge as there was quite often one or two around. OTOH, having our own airfield meant we could also use more conventional aircraft...

      1. Anonymous Coward
        Anonymous Coward

        Re: 'reviewed ... by hand'?

        I once met a guy who worked there, his office was triple-glazed to keep the engine noise out. He said it wasn't unusual to look up from his desk, and get a wave from the pilot of the Harrier that was hovering just outside the window.

        1. Jou (Mxyzptlk) Silver badge

          Re: 'reviewed ... by hand'?

          Sounds like my dream office...

    3. Albert Coates
      Mushroom

      Re: 'reviewed ... by hand'?

      Brizzle to Shit City in 2 hours? Just about doable, it's only 120 miles. It's relatively unlikely that many people have driven down the M4 at around 145 mph in a well-sorted Renault 16-valve Chamade at 3 am, but I did once. It was fairly scary, the front end was feeling pretty light, like threatening to do a Donald Campbell in Bluebird. Just saying.

    4. ColinPa Silver badge

      Re: 'reviewed ... by hand'?

      After an incident one customer had his team go to the new DR site - but their badges were not on the system, so they could not access the building, or the operation room. The security guard would not let them in (a good thing) until the problem went up the management chain, and down again which wasted about 15 minutes.

      1. Anonymous Coward
        Anonymous Coward

        Re: 'reviewed ... by hand'?

        Ah the NHS, that reminds me off. Late on a Friday, asked to do a job because all the perm engineers had fucked off. Asked to do a job with no ticket because the lady that asked us was the daughter of the head of IT so it appears she's allowed to do whatever the fuck she wants. We get to the other building and our cards don't work. We call her, she has to walk all the way over and down the hill to find her card doesn't work either.

        We catch a passing nurse who lets us in after we explain why we're there. We find another two nurses where we pick up the kit, only to be told the engineers were there earlier picking up some but not all the kit. Appears because it was Friday they were being lazy cunts and only took the keyboard and mouse and thought they'd leave it for the temps.

        We load up the trolly of kit and push it up the steep hill. Get to the department its for and no one knows anything about it. FFS. Eventually someone who does knows and we have to carry it up windy stairs.

        And people wonder why the NHS is fucked.

      2. Anonymous Coward
        Anonymous Coward

        Re: 'reviewed ... by hand'?

        Shortly after COVID lockdowns I was chatting to someone I knew vaguely who worked for one of the big banks .... He observed that as they'd just survived without major issues for several months with all their staff working from home that maybe they'd overpaid 5 years earlier when they'd spent millions for a dedicated DR site ready just in case their main office became unusable.

    5. Anonymous Coward
      Anonymous Coward

      Re: 'reviewed ... by hand'?

      Had a local council decide to close one of their other main offices, only to reopen shortly after because of outrage from residents. They were going to sell the building. It was pointed out to them, which the execs didn't seem to even fucking know, that it was also their disaster recovery site, they were also close to shutting off its broadband. The broadband that is used for DR.

    6. CrazyOldCatMan Silver badge

      Re: 'reviewed ... by hand'?

      I used to work atan airline reservation CRS based in Wiltshire.. (the one that had a name starting with 'G').

      The stated goal was that we could act as a DR cenre for our US partner so we had up-to-date copies of their code and data snapshots. So, in the event of a 747 landing in the wrong place (their data centre was/is [1] next to one of the runways at Denver airport) we could at least run a basic service.

      Then they decided to close the European site (us - although I'd left by then, my wife still worked there) [2] and offered relocation to Denver, fully expecting a majority of the dev and ops staff take them up on their offer. They got a lot less than 10%..

      Said event happened after we got a new CEO - a US guy who just happened to be from the US partners.

      One of their main competitors then opened a dev centre in the town and a *lot* of the devs joined them.

      The irony is that they then opened a 2nd data centre as DR - that was right next to the main one. So any disater affecting one would almost certainly affect the other too. How to *not* do BC/DR

      [1] Dunno whether they still exist and can't be bothered to check.

      [2] She got good redundancy pay so we basically had all one summer off (I was EKS so had plenty of money in the company account).

      1. steviebuk Silver badge

        Re: 'reviewed ... by hand'?

        Tell us who they are so we can look :)

  2. Boris the Cockroach Silver badge
    FAIL

    And theres

    the oft hilarious tale of the guy who made a disaster recovery partion on his single hard drive to save his valuable data in case of failure... yeah we know where this one is going.... failing to account for his HDD failing....

    "Hello Boris... can you have a look at my computer......"

    Wheres the "head banging into a wall" icon ?

    1. Anonymous Coward
      Anonymous Coward

      Re: And theres

      Yup. Home user wanted help to get his laptop working. I get there and the laptop says cannot find boot drive, and it is making clicking noises (aka, click of death).

      Sorry Jim, it's dead. Unless you want to spend $400+ to send the drive off for data recovery.

      1. MiguelC Silver badge

        Re: And theres

        While in uni I recovered a dead hard-drive by buying an exact same model (and version and revision) and moving the actual disc from the dead drive to the good one. All that done in my room (it was clean, 'guv).

        In those days with single platter drives it was risky but feasible. Now, with multiple platters, helium filled and sealed drives, good luck with that...

        1. Paul Crawford Silver badge

          Re: And theres

          Did that once in early 2000s, next time my desktop machine had RAID...

          I did have most data backed up, but not very recent and to save the trouble of reinstalling windows and all of the related software & licensing. Converting to a VM (the seemingly long lost pysical2viturual converter from VMware?) and running on a RAID host was a great relief!

        2. CrazyOldCatMan Silver badge

          Re: And theres

          While in uni I recovered a dead hard-drive by buying an exact same model (and version and revision) and moving the actual disc from the dead drive to the good one. All that done in my room (it was clean, 'guv)

          Several jobs ago, we moved into a shiny new office and I was responsible for moving all the Sun kit, including a very elderly SparcArray (hadn't been turned off in years.. it held the various user home directories). In prep, I arranged a space SparcArray chassis to be sent to me and got a small stock of replacement drives.

          The commercial movers delivered the array without any extra dents of drop-marks so, after hooking up the various Sun boxen and making sure that they worked, I connected the array and powered it up.

          About 50% of the (fairly small) drives failed to power up. Enough to kill the whole thing, overwhelmed the data protection aspects of RAID5.

          In a panic I phoned our Sun engineering contact to see if there was anything I could do.

          His advice? Take the failing drives out, knock them edge-on against a solid object and put them back in. Eventually (after a lot of "are you sure" and "it's not April 1st" questions) I tried it - after all, I wouldn't be losing anything by trying.

          Enough of them span back up that the RAID array could re-establish itself and I could replace the really-failed-drives with some of the spares I'd got. Some of which also failed but I had enough to get things up and working.

          And promptly wrote a business case for replacing the whole damn thing with something more modern (and higher capacity) - which I got 3 months later (can't remember what it was, probably a late generation Sparc Storage array

        3. Conundrum1885

          Re: And theres

          Recovering a laptop drive that got hit by lightning, not only frying every USB port but also the hard drive's controller and screen fuse as well.

          Yes, had to not only transplant the PCB back when that still worked, but also swap the heads. Recovery was a success but unbelievably the machine actually booted and ran with a Frankendrive (tm) though did advise them that they should really REALLY back it up somewhere safe. Nicknamed that machine 'Nearly Headless NIC' and inscribed this on the toe-tag.

          Alas my USB stick zapped with many thousands of volts of nasty snappy blue lightning didn't work ever again, no saving that unfortunately.

        4. SCP

          Re: And theres

          I remember [years back] having one of those HDDs that had a notrious firmware bug in which some internal counter wrapping round would brick the drive. There was a fix that involved dimantling the drive and inserting some FTDI cables connected to a serial port on a working PC, then running up a terminal window and typing in various magic incantations. Ended up doing that twice to the same drive as it bricked tself again.

          I thought I had filed away the details somewhere [might come in useful one day :-)] - but can't seem to lay my hands on it. Another problem with data recovery - finding the backups.

          1. Jou (Mxyzptlk) Silver badge

            Re: And theres

            Today you would program a esp32 directly soldered or plugged inside of the drive to do that every time it is powered on :D.

          2. SCP

            Re: And theres

            [Footnote: It was a Seagate 7200.11 (ST3750330AS) and I have found the firmware update (SD15) that eliminated the problem - saved with my other 'old' driver collections - but no sign of the fix (which was more interesting from an engineering hacking point of view). I hope I printed the notes and stored them with the cables 'just in case'. Ah, memories of the good old bad old days.]

            1. Jou (Mxyzptlk) Silver badge

              Re: And theres

              Oh THOSE! I remember those many bulletins for servers, which were sent to customers using those drives. A month or two later it was included in the "Server update routine" of the manufacturers. For Fujitsu and HP: I know, I saw it added, and I suspect the check is still in there today to update the HDD firmware automatically. Other server manufacturers: I don't know, but I bet the same.

    2. bombastic bob Silver badge
      Alert

      Re: And theres

      Murphy's Law applies with users' backup "solutions"

      Yet in the world of computers and IT, Murphy was an OPTIMIST

  3. Anonymous Coward
    Anonymous Coward

    One copy is no copy, two copies are half

    tl;dr: Data is never safe. Don't make fun of those poor sysadmins.

    I once was explained the design of a computer archive storing irreplaceable audio and video recordings. The last recordings of death languages and vanished people.

    The primary system was a computer always running. Bits not on a life system rot away. Next to it, in the same room,a copy of that computer. Both mirroring each other.

    There were two other such running twin systems in another country far away from each other. These other systems were life copied, mirrored.

    Parts of the archive were also stored on other continents. And the archivists were worrying about the long term readability of audio and video formats.

    The archivists had secured funding for 50 years for preserving the bit streams (just the bits). And I was tasting some anxiety in the room about the future integrity of the data.

    Bits die when no one looks at them anymore.

    Archiving data, backing up, is hard, very hard.

    1. Anonymous Coward
      Anonymous Coward

      Re: One copy is no copy, two copies are half

      I thought 3 backups tapes were good enough (long time ago).

      Then my Amiga PC erred during a backup. So now I had two good backups, maybe. Luckily it was a file in the disk backup routine I wrote that was cross linked (or something) during the previous backup. So once I tracked that down and changed my code, all was good.

      That is good after I bought several more tapes.

    2. wolfetone Silver badge

      Re: One copy is no copy, two copies are half

      "Bits die when no one looks at them anymore."

      You die 3 times.

      First, you die. Then you die again when people stop talking about you.

      Then you die a 3rd time when the only record of your existance doesn't get looked at on a dusty old hard drive.

      1. John Brown (no body) Silver badge

        Re: One copy is no copy, two copies are half

        And then Tony Robinson (or a descendant) comes along and digs you up for 45mins of TV "edutainment" :-)

  4. Doctor Syntax Silver badge

    "You cannot restore backups on-site if you can’t access your office."

    If you can't access your office not being able to restore them may well be the least of your problems. Recovering your organisation from a fire can be interesting.

    1. PCScreenOnly

      Bombs too

      Always surprised me how bt would take an absolute age to do anything. But with the houndsditch bomb (dump truck, not st Mary axe or Liverpool st that to out just glass) weren't, they could move all our lines from our office around the corner to Cobham in Surrey within 2hrs

      1. PCScreenOnly

        Re: Bombs too

        And after that, we left a "server" in the Cobham office, During a meeting, I decided I would go and do a restore from a previous tape. Only one problem, the toe rags in Cobham were using *our* server.

        I started to do random arrivals and test restores. One week I did three consecutive days. I was asked "what are you doing and why, you were only here yesterday or the day before". I simply replied "you never know when a disaster may happen"

      2. ricardian

        Re: Bombs too

        At a seaside resort in North Yorkshire in the 1980s BT were advertising their remote fire monitoring package - a fire would trigger an alert and all would be well. Such a shame that a few weeks later a fire broke out in the local exchange during the wee small hours and by the time anyone noticed the place had burned to the ground

        1. WolfFan

          Re: Bombs too

          It was, after all, a _remote_ package. The fire was in the exchange. It wasn’t remote, and therefore couldn’t activate the package. Union rules, y’know. Now, if the package had been installed next door, and the fire was over there, then it should have activated. Right?

    2. nick turner

      Earlier in my career I used to do a lot of DR planning work for companies. The usual answer to the question "what would we do if a meteor hit the office" type question would be to claim the insurance and retire to a nice tropical island.

      1. PCScreenOnly

        Same area of town

        I know one company whose main DR site was literally the other side of the Thames. They didn't seem to really grasp that some disasters can take out large area's for various reasons

        1. smudge

          Re: Same area of town

          I once did a review of a company whose operational site was at St Katharine's Dock, at Tower Bridge, with all their stuff at a basement level below the level of the river.

          They were setting up a DR site, and despite me telling them that they couldn't rely on the Thames Barrier, and also pointing out that they were on the approach path to Heathrow - a plane crash could take out a large area - they set it up just a mile or so downstream :(

        2. Anonymous Coward
          Anonymous Coward

          Re: Same area of town

          In the winter of 2001-2002 a large US credit card company had a datacenter roof collapse due to the weight of snow. No problem normally, just switch to the backup site.

          Unfortunately their backup site had been in NYC, in the World Trade Center, which at that time was a smouldering pile of rubble. They hadn't yet completed the construction of the new backup site.

          Sometimes, shit just happens.

          1. Roland6 Silver badge

            Re: Same area of town

            That’s why you have tertiary backup. The only problem I encountered in one bank was that they had no recovery path back from the offshore tertiary data centre…

            1. Peter Gathercole Silver badge

              Re: Same area of town

              I worked at a UK bank that had a very good disaster recovery plan, and even tested important parts of it regularly to see whether they could bring up selected systems at the DR site (but did not actually run the service from there).

              The only problem was that they didn't have a plan to repatriate any of the services in the case that it was a temporary (but long enough to invoke DR) outage! In fact, once the services were running at the DR site, there was no plan to reconstruct or resynchronise the primary, or any tertiary site for further protection.

              It was a standing joke that if DR were to be invoked, almost the whole infrastructure team would move the services to DR, and then had their notices in, because they could not see how to return from DR.

        3. ColinPa Silver badge

          Re: Same area of town

          One US company I worked with was based in San Francisco. Their DR site was about 200 miles south. This was thought to be OK, until some bright young person said they were both on the San Andreas fault line. An earthquake would have taken out both data centres. It cost them $10 million to move it - but that was cheaper than potential lost (and embarrassment) if they were hit by an earthquake.

  5. chivo243 Silver badge

    once upon a time at the pub

    A colleague and I were pondering data. All kinds of goofy stuff like, how much does data weigh? But one idea we had was to keep data safe by keeping it in motion, at the time the obvious solution was bittorrent... and another pint.

    I remember the first time I saw the tapes fail, it was beyond horrible... I was new, and I got to see first hand a full on total disaster, with very little recovery. That event probably saved the org. Had they not had a major loss of data when it did happen, they may have continued on growing not having a proper SOP, and a disaster like this one would have crippled or killed the org later.

    Perhaps the process should be called Restore or Recovery, not Backup.

    1. Ian Johnston Silver badge

      Re: once upon a time at the pub

      But one idea we had was to keep data safe by keeping it in motion

      In the clacks overhead, for example.

      1. Flightmode

        Re: once upon a time at the pub

        GNU Sir Terry

    2. smudge
      Boffin

      Re: once upon a time at the pub

      But one idea we had was to keep data safe by keeping it in motion

      I've seen that done physically. Late 70s, before PCs and online storage for everyone, even us IT types used to store physical copies of things - manuals, brochures, reports, design documentation, code listings, etc etc.

      One guy I knew didn't have enough storage space, so he would get some large boxes, parcel up stuff that he thought he wouldn't need for a while, address them to himself, and then take them along to the "post out" location. A few days later they would come back to him.

      Note that this only works in large, impersonal organisations.

      He claimed to be inspired by mercury delay lines*, which really did store data in motion.

      *I would say "ask yer grandad", but I'm old enough to be yer grandad and they were way before my time....

      1. elkster88
        Windows

        Re: once upon a time at the pub

        "He claimed to be inspired by mercury delay lines*, which really did store data in motion."

        A number of years ago, I attended a talk at a Lockheed Martin building that used to be a Sperry Univac facility. The guests of honor were a handful of elderly gentlemen who were some of the first employees of Univac/(ERA?). They talked about the mercury delay line memories and how they created the first drum memory units by gluing magnetic tape to the drum.

        Some of the engineering solutions to the problems of early computers were pretty elegant, if bulky... Witness the cathode ray tube dynamic memory devices used in the Von Neumann computer at the Princeton Institute for Advanced Study.

    3. A_O_Rourke

      Re: once upon a time at the pub

      The wight of Data, explained by Hanna Fry

      https://www.youtube.com/watch?v=bo389Jv9Zkw

    4. CrazyOldCatMan Silver badge

      Re: once upon a time at the pub

      I remember the first time I saw the tapes fail, it was beyond horrible..

      I loathe backup tapes with a passion. They can always pass a random restore test but, when you *actually* need to do a live restore, they always manage to give read errors.

      We've migrated off them entirely now to a live online backup service. I did point out that this was a major weakness in that it requires a valid internet link but got ignored..

      1. Peter Gathercole Silver badge

        Re: once upon a time at the pub

        This is why you use a storage management system which automates the copying of media. For example, a sufficiently designed tape management system will have at least two on-site copies at any time, and at least one off-site copy, which will either be a physical tape movement, or some remote copy.

        A system such as IBM Storage Protect (aka TSM) can manage all of this for you if it is set up correctly.

      2. chivo243 Silver badge
        Thumb Up

        Re: once upon a time at the pub

        Me three... I also got them to finally cut the tape. Backups to other sites, off site removable disk, local NAS.

    5. NXM

      Re: once upon a time at the pub

      "keeping data in motion"

      That's the mercury delay line! It sent serial data along a tube of mercury and the receiver at the far end recycled it back to the start again. Not much capacity for today's requirements though. I'm sure there was some setup involving high persistence CRTs as well.

  6. Ball boy Silver badge

    Always check what you're backing up

    One enterprising chap I knew (no, it wasn't me) went the full nine yards: installed a DAT drive on a dedicated SCSI card, a big handful of tapes marked up for daily, weekly, month, quarter end and all that good stuff. Copied a big chunk of live data to a temp volume so he could test backing it up, then safely deleting / changing the fileset before monitoring the perfect restoration of the data as he expected.

    Utterly confident his backup was working and he knew exactly how to do partial or full restores, he carefully added the various backup intervals and then religiously changed tapes, keeping the long retention versions in an off-site fire safe....until a RAID upgrade meant his temp. volume got dropped and it suddenly became very obvious he'd never changed the backup's config. to point to the live dataset. Lucky sod got away with it, didn't you Dom (I assume you still read El Reg)

    1. Anonymous Coward
      Anonymous Coward

      Re: Always check what you're backing up

      "installed a DAT drive on a dedicated SCSI card"

      Ah yes, remember this vital point: *dedicated* !

      Back in the years, I did the mistake of neglecting this, having both disk and tapes traffic through the same card.

      It turned out the server soon died of data corruption, unrelated to HW problem !

  7. trevorde Silver badge

    Old skool data recovery

    Worked at a patent attorney's in the days before computers. This industry wastes paper like you wouldn't believe. Even the small practice I was in had rooms full of filling cabinets, stuffed with correspondence. Anyway, one of our overseas associates had a fire in their building and lost *everything*. The only way to recover was for them to write to everyone and ask for copies of all communications ever. This generated even more paper. It took a few months but they mostly survived.

    1. Eclectic Man Silver badge

      Re: Old skool data recovery

      The only way to recover was for them to write to everyone

      How did they know who 'everyone' was and their addresses?

      1. Robert Carnegie Silver badge

        Re: Old skool data recovery

        Everyone. (?)

  8. GlenP Silver badge

    Backups...

    For a lot of years my standard for backups was, and to an extent still is, Daily, Weekly, Monthly, Yearly with appropriate retention periods for each of those (and where possible add in hourly log or differential backups). We had an ISO9000 auditor who demanded that we, "Must use Grandfather - Father - Son backups!" In other words, he wanted me to go over to a three tape system with far less resilience just to satisfy his ideas on what's sufficient and secure. Needless to say I continued as before.

    These days the cloud/on-site service deals with most of it anyway but our database backups follow the above pattern, with the Monthly and Yearly backups occasionally being used when someone asks for information as at a particular month or year end.

    1. Andrew Scott Bronze badge

      Re: Backups...

      Knew someone who paid for a cloud backup service. one day when needing to restore data she discovered the data stored in the cloud for her account was not available. Service claimed they didn't know what happened to her data. not sure I'd trust cloud backup. It's also not available when your network connection is unavailable like the time that wind storm brought the neighbors tree down on your cable network connection and they take 2 weeks to repair it while claiming that it's been repaired every time you call them. Yep, love the service i get from comcast.

    2. Robert Carnegie Silver badge

      Re: Backups...

      Tell 'em you've got grandfather, son, grandson, great-grandson?

  9. K555

    The first thing I panic about...

    I check backups are working. I check them at random, I check them on schedule, I get reports, I have live monitoring for replication jobs.

    If someone phones up and says 'I've lost a file' I STILL get a sinking feeling and think to myself "Shit! I hope the backup actually works!"

    1. Andrew Scott Bronze badge

      Re: The first thing I panic about...

      know that feeling. back in the '90s had the server in my office. office had to be moved in one day which meant taking the server down, taking it apart, moving it to the new office and putting it all back together. did a backup to dat. moved everything, and it wouldn't start back up. volume drive was off line. found a spare and recovered from tape. at 1 am a dean walks in to do some work. explained that the system was being restored and i didn't know if her data was recovered. finished at 4 am, went home and was back at 8 am. molex power connector on the disk had a bad solder joint. repaired it a few days later. Scary. always a bit of dred when restarting a server, updating it, restoring files. aren't paid enough when your working life and vacation life are spent waiting for the call that the server lost someone data and it's your fault.

      1. John Brown (no body) Silver badge

        Re: The first thing I panic about...

        The biggest danger to kit powered on 24/7 is powering it off. Components happy to keep running often don't like that power "surge" of being switched on again, especially when they are more elderly :-)

    2. steve11235

      Re: The first thing I panic about... Yeah

      Our IT guys faithfully backed up all their Windows servers to tape. One day, they needed to recover. Oops, the tapes were old and unreadable. The president fired the IT manager and the guy in charge of backups over the VP's objections.

      1. WolfFan

        Re: The first thing I panic about... Yeah

        And that is why you generate a new tape set every ever so often. And why you have a spare system and restore to it every ever so often. Yes, it costs more. It also actually works.

        1. John Brown (no body) Silver badge

          Re: The first thing I panic about... Yeah

          Backups are not my problem, but one time I had to go to a client site to find out what was up with their local backup device. It turned out the tape was "expired" and was popping back out every night 5 mins after the designated person loaded it before leaving. They didn't see it pop out with a failure. They just saw it ready to be removed when they came in the next morning. Checked all the available tapes, and they had ALL expired. There was a distinct "oh shit" from the other end of the phone when I called their IT guys at their HQ. Not only were they not checking logs or having errors auto-emailed to them, but when they checked, it was all the other 40 or remote sites too.

          From that, I just sort of assumed that all backup tapes had a life expectancy and decent s/w would either read a manufacturing date from the tape or at least read the date it was formatted and fail after some number of uses or age factor, making it difficult to have a tape so old it was failing.

    3. Terry 6 Silver badge

      Re: The first thing I panic about...

      I thought I was pretty good at keeping my home/family backups secure - something I'd carried on from how I did things in my working days. I do daily backups to a second internal HDD which is only for backups, every couple of days to an external USB HDD, which I swap regularly so that there is another copy elsewhere in the house, then monthly or when I feel like, to a third internal HDD that's for general scratch stuff, and I make ad hoc copies to some old retired HDDs in USB caddies and the like- one of which is always stored away from home.

      So I absolutely shouldn't have any missing files, let alone whole folders, right?

      So where the fuck did the entire folder of fun or useful images that I'd created in Photoshop(Elements) go to?

      Not a trace, not on any of the backups.

      It's rhetorical question. I know what I must have done-I somehow omitted to include folders in that partition (because it's a partition used for fun trivial stuff that I'd thought doesn't warrant backing up), in my backup list. i.e. I'd pointed the backup software to all my data partitions except that one and then subsequently decided to use it to store stuff that I had started making in Photoshop and did want to keep; but forgot that the data folder in that partition wasn't included in the back-ups. And then one day I must have deleted that folder instead of just some subfolders with unwanted old work-in-progress files that I didn't need anymore.

      My conclusion,, that there is no 100% guaranteed method other than maybe constantly making backup copies in several different locations of everything that goes onto the computer. No exceptions, no matter how trivial the content may seem. And not ever deleting any of it. Which may prove to be impractical.

    4. Roland6 Silver badge

      Re: The first thing I panic about...

      In todays world you need to check the cloud services…

      QuickBooks, for example, only provides backup (and recovery) as part of its Enterprise subscription. For people on lower subscription tiers it’s a manual export of data, with no import capability…

      However, that might be incomplete. Last year when they updated their Payroll package, users complained that some data fields, specifically Notes were omitted from the migration and backups….

  10. K555

    User foot shooting

    We had a user that kept a lot of personal data on their Laptop. They kept it outside of a backed up area and were quite aware of this. But they were also confident it was fine because they'd purchased a USB HDD to make a copy to.

    Fair enough, really.

    Until their laptop died. Being enterprising enough to sort their own backup, they were also happy to have a go at fixing it. So they created a Windows recovery disc.... using their backup USB drive.

  11. firu toddo

    It's ok we have a backup.....

    Of course you do.

    Yeah. We make two daily backups to DAT. And we change the tapes every day. And we have three sets of tapes. SO that's 21 days.

    Do you check the tapes?

    Nah, they're digital.

    Ever restored from these tapes?

    Never! Don't need to.

    The backups never worked. If you don't check your backups you don't have any.

  12. mhoulden
    Mushroom

    Before doing anything critical, make sure you're using the right tape/disk/file/window/whatever. In 1971 they used the wrong codeword when they did a test of the US Emergency Broadcast System and put out an actual nuclear attack warning. It's on Youtube at https://www.youtube.com/watch?v=Yu4r79l8P8I. At least it interrupted a Partridge Family song.

    Another time an ISP was doing some work. They restored the last night's backup over the "live" one and lost 24 hours worth of customer emails. I was one of the people affected.

    1. JWLong Silver badge

      In 1971

      I missed this one, was in US Navy bootcamp at the age of 17 because I had graduated H.S. and received my draft notice in August 70'.

      USN, NEVER AGAIN!

  13. Anonymous Coward
    Anonymous Coward

    Many years ago, Corporate IT were 'invited' to take over responsibility for a system which had been built by an in-house skunk-works project, using hardware and software which were totally new to us. I reluctantly agreed, but with the proviso that the developers document the bare-metal recovery procedure and then prove it by reformatting the disk drives and restoring from backup. Needless to say, there was a crucial step missing and the backups were useless. I did get thanked by the team afterwards for teaching them an important lesson...

  14. the spectacularly refined chap

    Handwaving

    It is of course very easy to give a handwaving assertion that a full restore should be tested and tested regularly, but in many cases there is a very practical barrier to that.

    Just where are you to going to restore this backup to?

    1. CAPS LOCK

      Re: Handwaving

      A parallel hardware setup, in place as an emergency system. Always have a spare for everything.

      1. Andrew Scott Bronze badge

        Re: Handwaving

        if you can afford it. not always easy to convince the distributors of the coin that there is a good reason for having backup until the day they lose something and there aren't any working backups.

        1. Terry 6 Silver badge

          Re: Handwaving

          It took many years before my Local Education Authority managers agreed that we actually needed a backup for all the student data we saved to the shared drive/server. I did my best to back stuff up to CD/DVD drives, doing regular and duplicate copies and keeping some off-site, like in my car boot, which rightly wouldn't be allowed these days.

          When they did it was to a drive in a metal cupboard in an adjoining room. One backup- on site. And that's the best I could get. It was my constant nightmare. Some (potentially all) of that stuff had legal status. But the higher ups had better things to do with the funds, like regular new furniture for the Director's office.

      2. Paul Hovnanian Silver badge
        Facepalm

        Re: Handwaving

        "in place as an emergency system"

        For load balancing. Two (or more) servers, geographically separated. But each with full copies of all the data.

        We did this for years. The company was divided into two divisions. The North division had their server, the South had theirs. Each in a convenient broom closet local to their principle users. But in the event of a failure, just connect to the opposite division's system and fetch your stuff out of the subdirectory allocated to you. Good luck postutating an event like a meteor that would hit both sites, about 30 miles apart.

        It worked well until the IT godfathers mandated that ALL corporate data and servers were to be housed in their shiny new data center. We relocated both systems there. To a site that was haplessly located a few hundred yards from the Seattle earthquake fault.

        1. Benegesserict Cumbersomberbatch Silver badge

          Re: Handwaving

          Chicxulub was 200km in diameter. You want to plan for maybe 10% of a planet-killer? You might be over-engineering.

      3. the spectacularly refined chap

        Re: Handwaving

        Which is an equally handwaving response.

        "Boss, you know that £40,000 server we have, can we get another two of them please, one for backup and fail over and another we can play with to see if those systems work as intended?"

        "No."

        Where do you go from there? This is the default condition rather than the exception in the real world, so get used to it. You can either brush it under the carpet and pretend it doesn't exist or you can manage it. There are effective mitigation strategies that can be employed to de-risk such situations but you need to acknowledge their existence first.

    2. Anonymous Coward
      Anonymous Coward

      Re: Handwaving

      On my very first job there were three complete systems: the live system, the hot standby system which would failover in under 30 seconds and failover was tested once a month, and the spare system. The failover system would be updated from the live system on an hourly basis. The spare system was updated from daily backup tapes. The worst case would be that we would lose a day. Every month, when failover was tested, the failover would become the live system. The old live system would be rotated to become the spare. The spare would be rotated to become the failover. We would start a new set of tapes. The live system, the failover, and the spare were in three separate rooms, though they were fairly close to each other so that we could network them, this being the days of thicknet and thinnet, so network speed and distance were factors. Last months tapes would be placed in a fire-resistant file cabinet in yet another room. The tapes in that cabinet would be sent off-site. It was expensive. it meant that every month IT pulled an all-nighter one weekend to make sure that everything was running properly. It never died in the seven years I was at that job. It wasn't perfect; perfect would have been having the three systems in different locations, but that wasn't happening.

      When we commissioned new hardware to replace the old systems, we put the first new system in the same room as the spare, and used the spare to load our data to it. (The new hardware was considerably smaller than the old hardware while being much faster and holding more data and more RAM.) We then put another new system in the same room as the failover and updated from the failover. And then we put the last in the live system, and when failover time came, we cut in the new hardware. We left the old hardware in place, doing daily updates to the data, for three months, doing a complete rotation of the new hardware, to make sure that everything was working, and then decommissioned the old hardware. By that time we had much faster networking, so we could have moved the physical locations, but that idea was nixed. When last I heard the new hardware had been replaced by newer hardware, never having had a problem.

      Be paranoid. Back everything up. Restore your backups to a test system and check that they work.

  15. Claptrap314 Silver badge
    Boffin

    Disturbingly easy to surprise people

    At my last job, I got handed a significant chunk of our compliance work.

    It was bemusing to receive a questionnaire that assumed that backups were not immediately tested.

    I don't know what you call an untested bunch of bits supposedly written somewhere without being restore tested, but it ain't "backup".

  16. Pete 2 Silver badge

    Assume the worst

    * Your multi-tape backup will fail to restore on the final tape

    * when you do restore your data, it will include files that users had deleted _after_ the backup had finished.

    * software race conditions magically appear the "wrong" way round when faster hardware in installed

    * faster hardware just moves the bottleneck, it doesn't fix things permanently

    * everything takes longer the greater the urgency

    * reliability is inversely proportional to the size and importance of the audience

    * everything works perfectly until you close up the box

  17. AGK
    IT Angle

    Yes, this story gets back around to the backup thread.

    I was just listening to a video podcast about Steely Dan and the making of the Gaucho album. It seems they were perfectionists. As such, they created a master (cassette) tape of each track, including one called "Second Arrangement". After days of recording and production, an engineer accidentally stuck the cassette - THE cassette - into a recorder and recorded some test tones in order to tweak the recorder into final alignment. Over the one and only master recording of "Second Arrangement". As the podcast host commented, the good news was that they had 14 seconds of "Second Arrangement" and the bad news was that they had only 14 seconds. Although it was one of their favorite tracks and it was planned to be featured as a single, they decided that track would be dropped form the Gaucho album and it has never appeared.

    There was NO backup of the cassette because an analog copy of any recording would introduce artifacts, perhaps inaudible in the final track, but artifacts of imprefection (sic - intended).

    A cautionary tale from the analog world that has echoes throughout the digital world. Make backups and test them, dear readers, for anything you care to keep.

  18. Anonymous Coward
    Anonymous Coward

    the tao of backup

    http://taobackup.com/index.html

  19. el_oscuro

    As a DBA, I have had a saying over the last 25 years: "If you haven't practiced your restores recently, you don't have backups". And I have been prove right so many times, I have forgotten.

    I make it a point to incorporate restores into routine operations. Need to clone that database? Don't use the VM cloning utilities. Mount your backup media on the new server - and restore it. If you use a split mirror technology like EMC, have your backup script open the mirrored copy read only. Not only do you validate your backups, you can use them as a reporting database. The list goes on and on. Every time you need to make a copy of something, restore it from your backups. You are continuously testing your backups, as well as practicing your own skills.

    So when the shit hits the fan and everything breaks, you won't scramble looking for the restore SOP - because you just did one a few days ago.

  20. Anonymous Coward
    Anonymous Coward

    NHS

    Many moons ago I went to a user at one trusts head office to find their deleted folder in Outlook had lots of folders and files. I said "What's this? You've been deleting a lot", they said "No, that's where I organise my Outlook e-mails". I said but its the deleted folder for a reason, it WILL get deleted.

    Had a head of service who was difficult to work with but I got used to her. Everyone else avoided her. Helped her out, all her files were on her desktop so I made a shortcut to the network drive and told her to move them or she'll loose them. She said she would later. So I left. A week later I came back to that site with IT running round like headless chickens. She was kicking off because she had lost a file and they were wondering what to do. I caught her in the kitchen and told her "I told you this would happen if you didn't move them". She admitted her mistake but appears only to me. I told others I'd warned her so it all calmed down. I still got treated like shit at that place by IT upper management though. I hate being lower down the chain but also don't want to be a manager.

    One engineer at another trusts head office felt abandoned so regularly never bothered to put in the backup tapes for that day. So that trust would be without backups for a whole week sometimes. He was lucky nothing ever went pop.

    They bought in a load of temp engineers and treated them like gutter pigs. Explained if they don't do a good job, they'll be replaced. This was for a massive roll out of new PCs. Of course, no one likes being treated like shit or kicked while trying to do a good job. I don't condone it at all but several of those PCs never turned up at sites. Because those temps realised their audit system was the shittest system know to anyone and know one knew where kit was once it left the storage unit. Those engineers decided they'd take the pay in PCs instead (again I don't condone it but understand why they did it).

    I like the NHS but when you work in it, especially the IT department you realise how many jobs worth are in it, how way too many overpaid managers are in it, how way too many overpaid consultants are used in it and how parts of it are run poorly by IT managers who just want to get bollocks onto their CV.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like