back to article I was a part-time DBA. After this failover foul-up, they hired a full-time DBA

No two mistakes are the same, but The Register thinks they're all worth celebrating each Monday when we serve up a fresh edition of Who, Me? – the reader-contributed column in which we share your most magnificent messes, and your means of making it out alive. This week, meet a reader we'll Regomize as "Derek" who told us that …

  1. Korev Silver badge
    Coat

    Yes, Derek should have RTFM. Has failing to do so led you into trouble?

    Well were the Maltese Cross?

    1. KarMann Silver badge

      They weren't just cross, they were falcon' furious.

      1. Philo T Farnsworth Silver badge

        Beat me to it.

        It was a Maltese Foul-up.

        My favorite line from the picture: "Might I remind you Mr. Spade that you may have the falcon, but we certainly have you."

        Second favorite: "The cheaper the crook, the gaudier the patter." . . . which seems to have a certain amount of currency.

        1. vulture65537

          Re: Beat me to it.

          Mrs Farnworth raised children dippy enough to miss the real best line

    2. MachDiamond Silver badge

      "Yes, Derek should have RTFM. Has failing to do so led you into trouble?"

      Hmmmm, PT DBA that likely had a load of other things heaped on his plate would be called out for skiving off if he spent company time reading damn manuals.

      I got spoken to over the time I spent documenting my work at an aerospace company. Never mind the time I can point to where that documentation made the company an even $1mn.

    3. Anonymous Coward
      Anonymous Coward

      .. and here we have the Comment Of The Week.

      Congratulations :)

      (still laughing)

    4. JPCavendish
      Pint

      Outstanding effort. Have one of these on me --->

  2. Korev Silver badge
    Coat

    swift mirroring of any changes made in either location

    Well there's the problem, Taylor isn't known for her DBA skills

    1. Sparkypatrick

      Tay Tay is better known for her security skills.

    2. Just Enough

      "Taylor isn't known for her DBA skills"

      Haters gonna hate, hate, hate, hate, hate.

  3. Anonymous Coward
    Anonymous Coward

    RTFM. Has failing to do so led you into trouble?

    About 50/50; reading the (wildly inaccurate) manual and proceeding on the basis of its assumed accuracy has produced some unexpected consequences.

    Documentation can be out of date; copied from another platform where unique but unredacted features existed; contain feature were to be implemented but never were; just plain wrong.

    1. Tanaka

      Re: RTFM. Has failing to do so led you into trouble?

      Im just impressed his company *had* documentation!

      1. david1024

        Re: RTFM. Has failing to do so led you into trouble?

        Surprised they didn't have a ready reference for the recovery, or maybe a pre-rolled script? But that's the sort of thing a full time admin would do I suppose.

        1. Anonymous Coward
          Anonymous Coward

          Re: RTFM. Has failing to do so led you into trouble?

          Oh you sweet summer child

    2. mhoulden

      Re: RTFM. Has failing to do so led you into trouble?

      It doesn't help that some of the manuals out there are pretty dreadful. Oracle is one example. A single page jumps all over the place and you need to keep your wits about you to make sure you're still reading about the same product.

      1. Anonymous Coward
        Anonymous Coward

        Re: RTFM. Has failing to do so led you into trouble?

        After working for 28 years as Oracle DBA the one lesson I learned, ALWAYS ALWAYS have several test rig instances on hand to try out everything you just read before you even go near a dev DB, let alone test or prod!

      2. Alan Brown Silver badge

        Re: RTFM. Has failing to do so led you into trouble?

        The service manual for Harris shortwave transmitters is like that.

        There's a whole step-by-step section on stage alignment which ends with a link to a warning that this shoiuld only be attempted if you have a vector analyser at hand. That's a warning that belongs at the very start

    3. Anonymous Coward
      Anonymous Coward

      Re: RTFM. Has failing to do so led you into trouble?

      At a very large compamy, sorting out a relatively minor issue with an important, but fortunately dev (not live production) service: I followed** the clear step by step procedure in our documentation*, all worked fine until at the end of the process, were the words "DO NOT FOLLOW THIS ABOVE PROCEDURE AS IT WILL LIKELY BREAK <yet another attached service>, THE CORRECT PROCESS IS BELOW"

      Needless after fixing the unintended issues the fix had caused, I corrected the documentation to have the warning ABOVE the old process. (still needed for hiostorical context)

      *our documentation was using freebie forum software, as the techie powers-that-be had some irrational fear of using a wiki, despite the advantages that could be gained. (or any other intelligent documentation software.)

      **our support staff were extremely busy, usually with production tickets, so spending time trying to do a dev environment fix was given the bare minimum time to shoe-horn the task in.

      -anon as this will be recognised by anyone from that team-

      1. Anonymous Coward
        Anonymous Coward

        Re: RTFM. Has failing to do so led you into trouble?

        Yep, been seriously burned once or twice by this sort of thing, now I make a point to always read right to the end of any in-house docs as people have a tendancy to put "DON'T DO THIS NOW. DO THIS INSTEAD AFTER vX.Y ELSE BAD SHIT WILL HAPPEN!" somewhere half-way through the doc!

      2. xeroks

        Re: RTFM. Has failing to do so led you into trouble?

        I call this the "Serve with rice" documentation bug.

        ie you spend an hour following a recipe, all the constituent parts are cooked and at the correct temperature, then the recipe finishes with "serve with rice," an element that will take 10 minutes to make and hasn't been mentioned in the instructions until now.

        1. Stevie Silver badge

          Re: RTFM. Has failing to do so led you into trouble?

          Or the Haynes manual procedure where they omit the vital first step: "first build inspection pit".

          1. cosymart

            Re: RTFM. Has failing to do so led you into trouble?

            Haynes manuals are well known for their glib "remove the 4 screws" omitting to mention that one of the crews is hidden behind something that also requires removing of which there is no mention!

            1. TooOldForThisSh*t

              Re: RTFM. Has failing to do so led you into trouble?

              Or.. Or..

              As a young man I spent many weekends and dollars driving cheap Italian cars at ridiculous (and dangerous) speeds. I remember well the Haynes and Chilton (sp ?) manuals where a particular repair would start "First remove the engine" or "First disassemble the transmission" with of course no instructions or guide to where to find those needed instructions. Good Times.

            2. Anonymous Coward
              Anonymous Coward

              Re: RTFM. Has failing to do so led you into trouble?

              Or the "installation is the reverse of removal". Well, except for using custom tool ___ to reassemble that 4th part...

      3. Stevie Silver badge

        Re: RTFM. Has failing to do so led you into trouble?

        Cripes! This missing word thing is getting worse!

        Now posters are missing two consecutive words from their posts!

        Perhaps the G5 black helicopters in the covid vaccine are making themselves known!

        AIEE!

      4. The Oncoming Scorn Silver badge
        Facepalm

        Re: RTFM. Has failing to do so led you into trouble?

        I was caught by this sort of thing myself.

        Building a banking server, following step by step instructions & gleefully hit enter, there then followed 2.5 blank pages followed by a stop sign saying Do Not Hit Enter at this point or you will have to reimage!

        I did that at least twice, my excuse was it was usually something like 12.30am in the morning.

        The other one written by someone who gave surprisingly clear instructions if they were writing manuals for AliExpress, unfortunately they weren't. They left in about three paragraphs of steps, bullet points of actions. The next paragraph then followed up with this, please ignore & do not follow the above steps as they are no longer required with this new revision.

      5. Someone Else Silver badge

        Re: RTFM. Has failing to do so led you into trouble?

        Reminds me of one of my favorite M*A*S*H episodes.

        A missile fell into the Hospital compound. It was a US missile and a dud. The hospital team wanted it to be disarmed, but the nearest team that could do it was days away. The doctors figured that, "We are surgeons, how hard could it be"? So with the manual, they started step-by-step disarming the weapon. They opened the hatch, identified the arming wire. With Trapper John reading the manual, and Hawkeye wielding the tools, the key dialog went along these lines:

        Trapper: "Next, cut the red wire."

        Hawkeye: (repeating) "Cut the red wire." <snip>

        Trapper: (flipping the page of the manual) "But first..."

    4. vulture65537

      Re: RTFM. Has failing to do so led you into trouble?

      I remember great shock the first time I found d that.

      getopt command line treatment is often far from advertised too

    5. Anonymous Coward
      Anonymous Coward

      Re: RTFM. Has failing to do so led you into trouble?

      I'd say documentation could have been better, as in, the LOCAL should have been on the first page. You shouldn't have to assume there's more to an operation unless implied.

  4. DS999 Silver badge
    Trollface

    Paging Doctor Strange

    The warnings come AFTER the spells!

    1. MisterHappy

      Re: Paging Doctor Strange

      I have had this & it caused a failure & restore from backup. I followed the steps laid out in the suppliers documentation and borked the software.

      A call to their ServiceDesk and then one of their Tech people guided me to one of the appendices which stated that "If running version X, steps 7-9 should be omitted".

      On the plus side, it ticked the box for our monthly "restore a server from backup" test & the update worked perfectly when missing out the unwanted steps of the process.

  5. Sam not the Viking Silver badge

    Blind Spot

    My particular blind spot, if following a list of detailed instructions, is to fail to notice or absorb the second of two operations included in the same line.

    We have a machine-logger where the data can be downloaded by the end-user, but one line of the instructions calls for two actions and the second is often missed, resulting in a call for help. From personal experience, I know how to break the news gently, helping the customer to understand his failing in good humour ..... I have re-written the instructions but they have not been adopted, probably because the message contains more than one operation in the same sentence.

    1. FirstTangoInParis Silver badge

      Re: Blind Spot

      Ah yes. Like ‘microwave $your-dinner for 6 minutes. Half way through …..’ very annoying. I’ve noticed some nicer instructions ’cook for 4 mins, stir, cook for 4 more mins’.

      1. LogicGate Silver badge

        Re: Blind Spot

        My best one was at the very end: Best served with rice.....

        Yes, I could have read ahead...

      2. Potty Professor
        Boffin

        Re: Blind Spot

        I once tried to follow the instructions to cook a "Ten Minute Meal". The very first instruction was to boil the rice for 15 minutes. So how can that and all the other preparation be done inside the advertised ten minute window?

        1. that one in the corner Silver badge

          Re: Blind Spot

          > a "Ten Minute Meal"

          Isn't that simply referring to the time window you have to eat the mess, before it fully converts to an inedible blob on the plate?

  6. ColinPa Silver badge

    You should do this

    I remember getting into trouble when the documentation said "you should do .... " when it meant "you must do...". The word 'should' implies recommendation but not mandatory.

    Fortunately it was only a test system we could easily recreate. My supervisor got the documentation changed.

    1. jake Silver badge

      Re: You should do this

      See https://www.rfc-editor.org/rfc/rfc2119

      1. FirstTangoInParis Silver badge

        Re: You should do this

        > See https://www.rfc-editor.org/rfc/rfc2119

        See also RFC 6919.

      2. Roland6 Silver badge

        Re: You should do this

        I seem to remember the ADA Stoneman requirements document (1980) getting lots of references for its word usage definitions; almost as if this was something new and previously unheard of..

    2. rafff

      Re: You should do this

      Too many people are afraid to use a straight imperative: "do X". This avoidance is generally fine in a social situation, but not in an operational one; the listener/reader needs to know whether he is receiving a request or an order.

      1. Evil Auditor Silver badge

        Re: You should do this

        I'd argue that even in social situation communication often profits from clear wording. Countless times I've heard complains along the line of "...they didn't do what I asked them..."

        To which I always respond with: how exactly did you ask/request/demand/order? And all to often we learn that the "request" was a merely hinted, indirect suggestion. (Different cultural backgrounds make it even more interesting.)

    3. Dan 55 Silver badge
      Flame

      Re: You should do this

      gov.uk is full of "you may" and "you might". If anyone is supposed to give clarity, it's them.

      1. HorseflySteve Silver badge

        Re: You should do this

        Legally important words, such as must, may, should, and, or,etc, are to be interpreted in the UK as instructed by the Interpretations Act 1978 as ammended. See https://www.legislation.gov.uk/ukpga/1978/30

        The importance of this was illustrated when the EMC directive was incorporated into UK law more or less verbatim. It stated that the CE mark "must appear on the product, packaging or documentation"

        Unfortunately, the Interpretation Act defines "must" as compulsory and "or" as exclusive, so it meant that the CE mark had to appear on one *and only one* of these places. Placing it on on more than one was, therefore, illegal in the UK.

        They had to amend the EMC legislation to sort out the mess.

        In case anyone is wondering why the CE mark wouldn't be on the product, there is a legal minimum size and spacing for the mark and some product are just too small, hence the reason for the alternatives

        1. Dan 55 Silver badge

          Re: You should do this

          Not always legal documents. For example, compare UK with Sweden. The UK's scant advice is liberally peppered with "might"s and "could"s. Sweden's seems like someone's put some thought and effort into it and is clearer for not including so many conditionals - there is only one "might" and it's there for a good reason.

      2. A.P. Veening Silver badge

        Re: You should do this

        If anyone is supposed to MUST give clarity, it's them.

        FTFY

  7. Doctor Syntax Silver badge

    "Yes, Derek should have RTFM. Has failing to do so led you into trouble?"

    The real trouble starts when the manglement read the sales lies blurb and believe it.

    No, take a step back - when the vendor reads their specification and believes it. Works in test is not the same as works in production. Works in production on original platform and works in test on ported platform is not the same as works in production on ported platform.

  8. I Am Spartacus

    Seen it done at a hardware level

    We had VAXen boxes with mirrored disk pairs. There was a disk issue that caused on of the disks to drop out of the mirrored pair. The system carried on as it was supposed to do, no user impact at all. Digital came and replaced the drive with a new one.

    The System Manager then readied himeself to bring the new disk into the mirror and have it catch up. Unfortunately he got his source and destination wrong and before I couled yell out NOOOO, he mirrored the brand new, empty disk all over the remaining working copy of the production data.

    A real oh sh1t moment.

    1. Will Godfrey Silver badge
      Facepalm

      Re: Seen it done at a hardware level

      Been there. Done that {as I mentioned in a previous comment}

    2. Korev Silver badge
      Coat

      Re: Seen it done at a hardware level

      Did you DEC him for that?

      1. This post has been deleted by its author

        1. Anonymous Coward
          Anonymous Coward

          Re: Seen it done at a hardware level

          Well, sometimes StorageWorks and sometimes it doesn't...

    3. Detective Emil
      Stop

      Re: Seen it done at a hardware level

      After doing that once, you realised that the chunky "read-only" switch with the helpful indicator light was your friend.

      1. C R Mudgeon Silver badge

        Write protection

        I can't say I miss 3.5" floppies, but I certainly miss their write-protect tabs.

        The first USB drive I ever owned had a physical write-protect switch. Alas, I've never seen such a thing again.

        Of course, even if a device has such a switch, the question is whether you can trust it. Does it hard-disable all writes, or is it just a suggestion to the host not to attempt them? (The latter would guard against user error, but not against malware infection, since sophisticated malware could presumably ignore the suggestion.)

    4. Rivalroger
      Alert

      Oooh. I managed to avoid that

      I had lots of SUN boxes with mirrored disks but I never managed to add the wrong one back into disksuite or whatever they rebranded it before selling out to Uncle Larry.

      1. l8gravely

        Re: Oooh. I managed to avoid that

        Funnily enough I just had a Solaris 5.9 box lose a disk, but the onsite tech pulled the working disk by mistake. Oops! I had to get him to move it to another system entirely so I could fsck the damn thing and removed the broken mddev DB entries before it would reboot again. Once it did, all was well. Just took several days since he's only on site sometimes and I'm not a priority all the time. Oh well...

    5. Anonymous Coward
      Happy

      Re: Seen it done at a hardware level

      Probably more rare to find a sysadmin who HASN'T done this at least once in their career!

    6. Anonymous Coward
      Anonymous Coward

      Re: Seen it done at a hardware level

      I've done this, too. `mdadm` was the tool.

      I pulled a disk out that was a little flaky for some reason, and the system kept humming along happily. I zero'd the start and end of the drive (about 50-100MB on each side), to get rid of the metadata blocks for the RAID, and put it back into the system. From there I added it as a disk to the mdadm array, and mdadm started re-sync'ing the data. Great, I thought. Easy.

      Then the filesystem went read-only -- corruption. It turns out that mdadm did not interpret the disk as _new_, it interpreted it as a valid, up-to-date disk (aren't there checksums in the metadata block??) and started syncing the zero'd data over the start of all the disks.

      Sigh. Linux and disks (mdadm and btrfs) have cost so, so much data.

    7. DS999 Silver badge
      Facepalm

      Re: Seen it done at a hardware level

      he mirrored the brand new, empty disk all over the remaining working copy

      There is no reason that should ever happen. Any competently written mirroring software ought to sanity check the source and destination, and if the source has zeroed out blocks where you'd normally see partitioning information/boot blocks/filesystem signature/etc. OR if the destination does, something like "source disk appears empty, are you sure?" or "destination disk appears to contain a valid filesystem, are you sure?" should be the response.

      1. jake Silver badge

        Re: Seen it done at a hardware level

        "There is no reason that should ever happen."

        In the days when total system RAM was often under 1 megabyte[0], there was no room for today's hand-holding niceties. See, for example, the 1974 Version 5 Unix command dd, which is still with us and doing useful work over half a century later. Back then, we actually trained the staff using the software how it worked, and how not to break things. Shocking, I know ...

        [0] The first mainframe I had control over had 262,144 words of Core ... and I felt very wealthy indeed.

        1. Paul Cooper

          Re: Seen it done at a hardware level

          I first worked on machines with memory measured in kilobytes - even large, powerful (by the standards of the times) ones had maybe a few hundred kilobytes. And disc storage was megabytes at most. We COULDN'T have replicates of systems - there wasn't room!

      2. Evil Auditor Silver badge
        Mushroom

        Re: Seen it done at a hardware level

        "destination disk appears to contain a valid filesystem, are you sure?"

        I fully agree. But then there's I: what does the bloody mirroring tool know! Of course, I'm sure!

      3. C R Mudgeon Silver badge

        Re: Seen it done at a hardware level

        "There is no reason that should ever happen. Any competently written mirroring software ought to..."

        The key word there is "competently".

        See also: "In theory there is no difference between theory and practice, while in practice there is." (Which has been attributed to many people, but Quote Investigator traces it back to one Benjamin Brewster, then a student at Yale, back in 1882.)

    8. phuzz Silver badge
      Facepalm

      Re: Seen it done at a hardware level

      One of our recent HP servers had a problem with a disk in a RAID1 mirror. "No worries", says I, "I'll pop down and stick a fresh disk in, we've got some on the shelf".

      So I let myself into the server room, and found the correct server, which was easy, because it was the newest server in there. Helpfully, the failed disk had a big red light on the caddy, so I popped it out, and swapped in a spare. I went back upstairs to find that the machine was completely unresponsive.

      Long story short, HP in their infinite wisdom, had decided to add a large red light to their disk caddies to indicate which disks should not be removed. Thanks HP.

      So yes, I swapped the only good disk in a RAID with a blank one. (Swapping it back to just the good disk, and later inserting the spare one allowed the server to boot again).

      1. mtp
        Facepalm

        Re: Seen it done at a hardware level

        I used to work with a electron microscope with lots of illumunated buttons but some designer in his infinite crazyness decided that the ilumination shows which buttons could be pressed at the moment not the status of what that button controlled.

        So the button for operating mode X came on when it was not in mode X ETC.

        1. Evil Auditor Silver badge

          Re: Seen it done at a hardware level

          Seeing today's UI designs, it appears that usability labs are even rarer nowadays than they were a quarter century ago.

  9. ComicalEngineer Silver badge

    My experience of this was being given the manual for the wrong version of the software, and I'd actually read it!

    Unbeknownst to me, a senior colleague, who sat in an office down the corridor, had the latest version of the manual. It turned out that the software vendor had slightly changed the way that the program handled subroutines and passed variables in and out of said subroutines. We were each working on different parts of the code which had to be joined up at some point to make the simulation work.

    Only when we linked the sections of code it generated data that was obviously erroneous and in some cases just crashed with maths errors.

    Each of our code sections worked fine in isolation and we spent about a week trying to understand what was wrong (the program had somewhat cryptic error reporting).

    Eventually we realised that the problem was in the passing of variable values between my subroutines and my colleague's subroutines.

    The comment from my supervisor was "ah, well, we didn't expect you to be using that functionality [within the subroutines] just yet..." (I was fairly new to the project).

    I was then given a photocopy of the revised manual.

    1. Sam not the Viking Silver badge

      Issue Numbers

      We had a major fallout with a sister company who presented us with a production document: Issue B. When we received the product we found it did not comply. At the ensuing angry meeting, we tabled our version of Issue B only to be confronted with their 'updated' Issue B which contained differences affecting our assembly significantly. Their excuse was that the document in our possession (issued by them, with transmittal notes etc.) had not been 'Officially Issued' by an 'Authorised Member of the Company'. When we asked who had that authority, they mumbled..... a Mr. No-One apparently. It was their cock-up, they knew it, but it was deemed our fault.

      Our relationship with them, never good, became even more strained...... Bunch of wasters. In my opinion. You may find it hard to believe, they're no longer in business.

  10. BartyFartsLast Silver badge

    sewage sorting

    Remindse of those stunt EULAs and T&Cs where people commit to sorting sewage by hand or win a case of wine for reading them properly

    1. vulture65537

      Re: sewage sorting

      I remember IBM's packaging saying accept the conditions on this CD before you unwrap it.

      1. JPCavendish

        Re: sewage sorting

        Ahhh shrinkwrap contracts. IBM still does it, as does Microsoft with Windows retail packaging. It's been tested in law a few times and the conclusion seems to be that the terms are generally enforceable unless unconscionable, but the consumer/purchaser retains the right to reject the contract in entirety by returning the product. It's not legal to mandate purchase without the opportunity to first read the T&Cs.

        See also: clickwrap contracts for digital products.

      2. mirachu Bronze badge

        Re: sewage sorting

        Not just IBM.

  11. Larry D

    I once did a spec (remember those?) for an outsourcer to fix a performance problem. They announced they had implemented in production. I saw no change. They did not realise it was a 2 page spec so had only done the stuff on page 1.

  12. Anonymous Coward
    Anonymous Coward

    Malta networking

    The one thing you don't have to worry about much is the connectivity in Malta. I had a 1Gb/s home connection there when the UK still considered a 100Mb/s link top of the line.

    Their computer economy is a mix of gambling and gaming, and as ever when real money is involved there is decent infrastructure..

    1. Roland6 Silver badge

      Re: Malta networking

      My initial thoughts on the first few paragraphs of the article was whether the international cable was of sufficient capacity to handle both the failover traffic and the re sync traffic…

  13. Combat Epistomologist

    Bad idea

    The insufficiently understood issue I'm seeing not mentioned here is that bidirectional "multi-master" MySQL native streaming replication is inherently unsafe in the first place, and can result in silent data corruption. Especially when that replication is over a high-latency link. To briefly summarize, the problem occurs when two "masters" update the same record at close enough to the same time that the updates pass each other in flight, and when all is done each database ends up with the update written by the other node.

    I worked with a marketing company that insisted on doing this and CONSTANTLY ran into data errors where the two halves of their database cluster disagreed. Their solution was to constantly manually monitor for mismatches, decide which version of the truth was correct, and fire off corrections using Percona database tools.

    As I'm sure you can imagine, this didn't work well. It was a continuous dumpster fire.

    By contrast, for another customer I built a database solution with five primary nodes on four continents using Galera synchronous replication, with haproxy managing fail-over. And THAT solution worked. Yes, sometimes it was slow for customers in one region when their local node was down or rebuilding. But it worked, and kept on working, and didn't corrupt its own data.

    1. Anonymous Coward
      Anonymous Coward

      Re: Bad idea

      Good memories of Galera.

      Client had 5 5-node (production) clusters across 3 machine rooms (2+2 + arbitator in 3rd) and not a single data corruption in 5 years I was maintaining those clusters. HAproxy as front end.

      Machine room operator had a a funny habit to have ~1s long breaks in traffic between machine rooms and clusters didn't like that at all, long list of error log every time. Apparently they didn't bother to fix it as it continued several years.

      But users didn't notice and data didn't break, so changing master and syncing afterwards was working as expected. That's unfortunately not the default.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like