Yes, Derek should have RTFM. Has failing to do so led you into trouble?
Well were the Maltese Cross?
No two mistakes are the same, but The Register thinks they're all worth celebrating each Monday when we serve up a fresh edition of Who, Me? – the reader-contributed column in which we share your most magnificent messes, and your means of making it out alive. This week, meet a reader we'll Regomize as "Derek" who told us that …
"Yes, Derek should have RTFM. Has failing to do so led you into trouble?"
Hmmmm, PT DBA that likely had a load of other things heaped on his plate would be called out for skiving off if he spent company time reading damn manuals.
I got spoken to over the time I spent documenting my work at an aerospace company. Never mind the time I can point to where that documentation made the company an even $1mn.
About 50/50; reading the (wildly inaccurate) manual and proceeding on the basis of its assumed accuracy has produced some unexpected consequences.
Documentation can be out of date; copied from another platform where unique but unredacted features existed; contain feature were to be implemented but never were; just plain wrong.
The service manual for Harris shortwave transmitters is like that.
There's a whole step-by-step section on stage alignment which ends with a link to a warning that this shoiuld only be attempted if you have a vector analyser at hand. That's a warning that belongs at the very start
At a very large compamy, sorting out a relatively minor issue with an important, but fortunately dev (not live production) service: I followed** the clear step by step procedure in our documentation*, all worked fine until at the end of the process, were the words "DO NOT FOLLOW THIS ABOVE PROCEDURE AS IT WILL LIKELY BREAK <yet another attached service>, THE CORRECT PROCESS IS BELOW"
Needless after fixing the unintended issues the fix had caused, I corrected the documentation to have the warning ABOVE the old process. (still needed for hiostorical context)
*our documentation was using freebie forum software, as the techie powers-that-be had some irrational fear of using a wiki, despite the advantages that could be gained. (or any other intelligent documentation software.)
**our support staff were extremely busy, usually with production tickets, so spending time trying to do a dev environment fix was given the bare minimum time to shoe-horn the task in.
-anon as this will be recognised by anyone from that team-
Yep, been seriously burned once or twice by this sort of thing, now I make a point to always read right to the end of any in-house docs as people have a tendancy to put "DON'T DO THIS NOW. DO THIS INSTEAD AFTER vX.Y ELSE BAD SHIT WILL HAPPEN!" somewhere half-way through the doc!
I call this the "Serve with rice" documentation bug.
ie you spend an hour following a recipe, all the constituent parts are cooked and at the correct temperature, then the recipe finishes with "serve with rice," an element that will take 10 minutes to make and hasn't been mentioned in the instructions until now.
Or.. Or..
As a young man I spent many weekends and dollars driving cheap Italian cars at ridiculous (and dangerous) speeds. I remember well the Haynes and Chilton (sp ?) manuals where a particular repair would start "First remove the engine" or "First disassemble the transmission" with of course no instructions or guide to where to find those needed instructions. Good Times.
I was caught by this sort of thing myself.
Building a banking server, following step by step instructions & gleefully hit enter, there then followed 2.5 blank pages followed by a stop sign saying Do Not Hit Enter at this point or you will have to reimage!
I did that at least twice, my excuse was it was usually something like 12.30am in the morning.
The other one written by someone who gave surprisingly clear instructions if they were writing manuals for AliExpress, unfortunately they weren't. They left in about three paragraphs of steps, bullet points of actions. The next paragraph then followed up with this, please ignore & do not follow the above steps as they are no longer required with this new revision.
Reminds me of one of my favorite M*A*S*H episodes.
A missile fell into the Hospital compound. It was a US missile and a dud. The hospital team wanted it to be disarmed, but the nearest team that could do it was days away. The doctors figured that, "We are surgeons, how hard could it be"? So with the manual, they started step-by-step disarming the weapon. They opened the hatch, identified the arming wire. With Trapper John reading the manual, and Hawkeye wielding the tools, the key dialog went along these lines:
Trapper: "Next, cut the red wire."
Hawkeye: (repeating) "Cut the red wire." <snip>
Trapper: (flipping the page of the manual) "But first..."
I have had this & it caused a failure & restore from backup. I followed the steps laid out in the suppliers documentation and borked the software.
A call to their ServiceDesk and then one of their Tech people guided me to one of the appendices which stated that "If running version X, steps 7-9 should be omitted".
On the plus side, it ticked the box for our monthly "restore a server from backup" test & the update worked perfectly when missing out the unwanted steps of the process.
My particular blind spot, if following a list of detailed instructions, is to fail to notice or absorb the second of two operations included in the same line.
We have a machine-logger where the data can be downloaded by the end-user, but one line of the instructions calls for two actions and the second is often missed, resulting in a call for help. From personal experience, I know how to break the news gently, helping the customer to understand his failing in good humour ..... I have re-written the instructions but they have not been adopted, probably because the message contains more than one operation in the same sentence.
I remember getting into trouble when the documentation said "you should do .... " when it meant "you must do...". The word 'should' implies recommendation but not mandatory.
Fortunately it was only a test system we could easily recreate. My supervisor got the documentation changed.
I'd argue that even in social situation communication often profits from clear wording. Countless times I've heard complains along the line of "...they didn't do what I asked them..."
To which I always respond with: how exactly did you ask/request/demand/order? And all to often we learn that the "request" was a merely hinted, indirect suggestion. (Different cultural backgrounds make it even more interesting.)
Legally important words, such as must, may, should, and, or,etc, are to be interpreted in the UK as instructed by the Interpretations Act 1978 as ammended. See https://www.legislation.gov.uk/ukpga/1978/30
The importance of this was illustrated when the EMC directive was incorporated into UK law more or less verbatim. It stated that the CE mark "must appear on the product, packaging or documentation"
Unfortunately, the Interpretation Act defines "must" as compulsory and "or" as exclusive, so it meant that the CE mark had to appear on one *and only one* of these places. Placing it on on more than one was, therefore, illegal in the UK.
They had to amend the EMC legislation to sort out the mess.
In case anyone is wondering why the CE mark wouldn't be on the product, there is a legal minimum size and spacing for the mark and some product are just too small, hence the reason for the alternatives
Not always legal documents. For example, compare UK with Sweden. The UK's scant advice is liberally peppered with "might"s and "could"s. Sweden's seems like someone's put some thought and effort into it and is clearer for not including so many conditionals - there is only one "might" and it's there for a good reason.
"Yes, Derek should have RTFM. Has failing to do so led you into trouble?"
The real trouble starts when the manglement read the sales lies blurb and believe it.
No, take a step back - when the vendor reads their specification and believes it. Works in test is not the same as works in production. Works in production on original platform and works in test on ported platform is not the same as works in production on ported platform.
We had VAXen boxes with mirrored disk pairs. There was a disk issue that caused on of the disks to drop out of the mirrored pair. The system carried on as it was supposed to do, no user impact at all. Digital came and replaced the drive with a new one.
The System Manager then readied himeself to bring the new disk into the mirror and have it catch up. Unfortunately he got his source and destination wrong and before I couled yell out NOOOO, he mirrored the brand new, empty disk all over the remaining working copy of the production data.
A real oh sh1t moment.
This post has been deleted by its author
I can't say I miss 3.5" floppies, but I certainly miss their write-protect tabs.
The first USB drive I ever owned had a physical write-protect switch. Alas, I've never seen such a thing again.
Of course, even if a device has such a switch, the question is whether you can trust it. Does it hard-disable all writes, or is it just a suggestion to the host not to attempt them? (The latter would guard against user error, but not against malware infection, since sophisticated malware could presumably ignore the suggestion.)
Funnily enough I just had a Solaris 5.9 box lose a disk, but the onsite tech pulled the working disk by mistake. Oops! I had to get him to move it to another system entirely so I could fsck the damn thing and removed the broken mddev DB entries before it would reboot again. Once it did, all was well. Just took several days since he's only on site sometimes and I'm not a priority all the time. Oh well...
I've done this, too. `mdadm` was the tool.
I pulled a disk out that was a little flaky for some reason, and the system kept humming along happily. I zero'd the start and end of the drive (about 50-100MB on each side), to get rid of the metadata blocks for the RAID, and put it back into the system. From there I added it as a disk to the mdadm array, and mdadm started re-sync'ing the data. Great, I thought. Easy.
Then the filesystem went read-only -- corruption. It turns out that mdadm did not interpret the disk as _new_, it interpreted it as a valid, up-to-date disk (aren't there checksums in the metadata block??) and started syncing the zero'd data over the start of all the disks.
Sigh. Linux and disks (mdadm and btrfs) have cost so, so much data.
he mirrored the brand new, empty disk all over the remaining working copy
There is no reason that should ever happen. Any competently written mirroring software ought to sanity check the source and destination, and if the source has zeroed out blocks where you'd normally see partitioning information/boot blocks/filesystem signature/etc. OR if the destination does, something like "source disk appears empty, are you sure?" or "destination disk appears to contain a valid filesystem, are you sure?" should be the response.
"There is no reason that should ever happen."
In the days when total system RAM was often under 1 megabyte[0], there was no room for today's hand-holding niceties. See, for example, the 1974 Version 5 Unix command dd, which is still with us and doing useful work over half a century later. Back then, we actually trained the staff using the software how it worked, and how not to break things. Shocking, I know ...
[0] The first mainframe I had control over had 262,144 words of Core ... and I felt very wealthy indeed.
I first worked on machines with memory measured in kilobytes - even large, powerful (by the standards of the times) ones had maybe a few hundred kilobytes. And disc storage was megabytes at most. We COULDN'T have replicates of systems - there wasn't room!
"There is no reason that should ever happen. Any competently written mirroring software ought to..."
The key word there is "competently".
See also: "In theory there is no difference between theory and practice, while in practice there is." (Which has been attributed to many people, but Quote Investigator traces it back to one Benjamin Brewster, then a student at Yale, back in 1882.)
One of our recent HP servers had a problem with a disk in a RAID1 mirror. "No worries", says I, "I'll pop down and stick a fresh disk in, we've got some on the shelf".
So I let myself into the server room, and found the correct server, which was easy, because it was the newest server in there. Helpfully, the failed disk had a big red light on the caddy, so I popped it out, and swapped in a spare. I went back upstairs to find that the machine was completely unresponsive.
Long story short, HP in their infinite wisdom, had decided to add a large red light to their disk caddies to indicate which disks should not be removed. Thanks HP.
So yes, I swapped the only good disk in a RAID with a blank one. (Swapping it back to just the good disk, and later inserting the spare one allowed the server to boot again).
I used to work with a electron microscope with lots of illumunated buttons but some designer in his infinite crazyness decided that the ilumination shows which buttons could be pressed at the moment not the status of what that button controlled.
So the button for operating mode X came on when it was not in mode X ETC.
My experience of this was being given the manual for the wrong version of the software, and I'd actually read it!
Unbeknownst to me, a senior colleague, who sat in an office down the corridor, had the latest version of the manual. It turned out that the software vendor had slightly changed the way that the program handled subroutines and passed variables in and out of said subroutines. We were each working on different parts of the code which had to be joined up at some point to make the simulation work.
Only when we linked the sections of code it generated data that was obviously erroneous and in some cases just crashed with maths errors.
Each of our code sections worked fine in isolation and we spent about a week trying to understand what was wrong (the program had somewhat cryptic error reporting).
Eventually we realised that the problem was in the passing of variable values between my subroutines and my colleague's subroutines.
The comment from my supervisor was "ah, well, we didn't expect you to be using that functionality [within the subroutines] just yet..." (I was fairly new to the project).
I was then given a photocopy of the revised manual.
We had a major fallout with a sister company who presented us with a production document: Issue B. When we received the product we found it did not comply. At the ensuing angry meeting, we tabled our version of Issue B only to be confronted with their 'updated' Issue B which contained differences affecting our assembly significantly. Their excuse was that the document in our possession (issued by them, with transmittal notes etc.) had not been 'Officially Issued' by an 'Authorised Member of the Company'. When we asked who had that authority, they mumbled..... a Mr. No-One apparently. It was their cock-up, they knew it, but it was deemed our fault.
Our relationship with them, never good, became even more strained...... Bunch of wasters. In my opinion. You may find it hard to believe, they're no longer in business.
Ahhh shrinkwrap contracts. IBM still does it, as does Microsoft with Windows retail packaging. It's been tested in law a few times and the conclusion seems to be that the terms are generally enforceable unless unconscionable, but the consumer/purchaser retains the right to reject the contract in entirety by returning the product. It's not legal to mandate purchase without the opportunity to first read the T&Cs.
See also: clickwrap contracts for digital products.
The one thing you don't have to worry about much is the connectivity in Malta. I had a 1Gb/s home connection there when the UK still considered a 100Mb/s link top of the line.
Their computer economy is a mix of gambling and gaming, and as ever when real money is involved there is decent infrastructure..
The insufficiently understood issue I'm seeing not mentioned here is that bidirectional "multi-master" MySQL native streaming replication is inherently unsafe in the first place, and can result in silent data corruption. Especially when that replication is over a high-latency link. To briefly summarize, the problem occurs when two "masters" update the same record at close enough to the same time that the updates pass each other in flight, and when all is done each database ends up with the update written by the other node.
I worked with a marketing company that insisted on doing this and CONSTANTLY ran into data errors where the two halves of their database cluster disagreed. Their solution was to constantly manually monitor for mismatches, decide which version of the truth was correct, and fire off corrections using Percona database tools.
As I'm sure you can imagine, this didn't work well. It was a continuous dumpster fire.
By contrast, for another customer I built a database solution with five primary nodes on four continents using Galera synchronous replication, with haproxy managing fail-over. And THAT solution worked. Yes, sometimes it was slow for customers in one region when their local node was down or rebuilding. But it worked, and kept on working, and didn't corrupt its own data.
Good memories of Galera.
Client had 5 5-node (production) clusters across 3 machine rooms (2+2 + arbitator in 3rd) and not a single data corruption in 5 years I was maintaining those clusters. HAproxy as front end.
Machine room operator had a a funny habit to have ~1s long breaks in traffic between machine rooms and clusters didn't like that at all, long list of error log every time. Apparently they didn't bother to fix it as it continued several years.
But users didn't notice and data didn't break, so changing master and syncing afterwards was working as expected. That's unfortunately not the default.