
I'll be borrowing this!
"There was a loose nut on the keyboard."
It will surely slow the torrent of questions next time there's a slow down somewhere!
A fresh week means a fresh story to add to The Register's regular hall of shame where hapless techies tell tales of in-the-field slip-ups: welcome to Who, Me? Today's near-calamity comes courtesy of a reader we shall refer to as "Dan", who was working on developing a card payments platform for "a well-known telco". panic If …
Not me but a colleague (honest!) who was trying to do a change but had been dragged into an incident on another server. Then it came to the point where he was supposed to reboot the server he was doing a change on.... and of course typed it in the wrong window. He 'fessed up quickly enough and is still working there, many years later, despite bringing down the production server.
Hell, I did that last week ... on a customer system.
Active/Passive pair of systems at two data centres.
Each are individually HA clustered - so there is some hint as to how they treat them.
Working on the passive side, set up a couple of monitoring windows - open a third terminal and bring the local cluster down...
Nothing shows on the monitoring sessions.
Customer was logged into the passive DC via the active one, and we'd accidentally ended up with a session on the wrong data centre, and brought down the active one (whilst the passive one was undergoing maintenance).
You might be interested in molly-guard
, which is designed to stop exactly that, by forcing you to type the hostname of the machine you want to reboot if you appear to be using SSH to run (eg) shutdown -r
.
The jargon file entry for "Molly guard" is worth a read too :)
Running commands whilst connected to the wrong system/database is a rite of passage for anyone in the IT industry. We've all done it at some point in our careers and anyone who says they haven't is either lying or is a PFY who has yet to experience that bowel liquifying moment that strikes almost immediately after pressing the ENTER key.
What is more interesting is how you recover from the situation and did anyone actually notice?
>>Why do people do otherwise ?
Because those with a long Unix memory remember when halt and reboot did not call shutdown if not run from shutdown, but instantly halted or rebooted the system. There's always a nagging feeling that this system *might*, *just possibly* be the last holdout of some old design or traditionalist BOFH who has these commands do what they used to do, back in the good old days. Relying on the safety mechanism built into recent versions of halt feels like pulling the trigger of a gun pointed at your foot and trusting that the safety catch is on...
As an aside, I still remember being taught that if you *really* had to crash-restart a system (shutdown wasn't working) and still had the console, the correct command was
sync; sync; sync; reboot
To give maximum chance that the filesystem would be is a consistent state or at least reparable on recovery. It was kind of a magic incantation that you hope you'll never have to actually use, and the extra syncs probably do nothing...
I seem to recall that the commands should be issued separately, to allow the syncs a chance of working while you were typing.
Exactly. With old, slow MFM drives of the ~ BSD 4.2 era, the idea was that you'd enter "sync", and by the time the shell was able to show the command prompt again, the flush would be well under way. Repeating the process twice more gave you high confidence that it was able to complete, because if there were a lot of pages to flush, the system would be correspondingly sluggish (particularly if the sticky bit was not set on /bin/sync) and so it would take longer to get the prompt and run sync again.
"sync; sync; sync; reboot" was a cargo-cult corruption of that process which lost the purpose of the multiple syncs. Yes, running sync the second and third time would give the kernel a little more time to finish flushing, but not nearly as much as entering them as separate command lines, and it wouldn't adapt to the system load.
As an AS/400 programmer with access to live systems, I do colour code them. Sessions on everything except development machines I set up with a non-standard colour background, production is purple (red hurts my eyes after a while), acceptance is turquoise and dedicated testing machines get a green background. Several colleagues use something similar, most after an OOOOOOPPPPSSS moment.
Years ago I saw that a colleague’s desktop background was red, I asked him if it was hard on the eyes and he told me that he was logged into an admin account - His own user account desktop was a nice restful grey. I stole the idea, and have used it ever since.
I seem to remember using a distro years ago that had a red desktop background with bombs on it when logged in as root...
I remember that as well! Had completely forgotten it. Mandrake or an early SuSE? Or the for-pay one that had the not-too-bad (for the times) "click and run" package manager, first time you clicked something it would install it if it wasn't already there?
I do remember the red background with the bombs. Kinda reminded me of Win311s' minesweeper..
Paris coz... I'll bet she has trouble remembering stuff too.
I used to have this at a previous job.
Terminals would be coloured according to the priority of the system. Just gentle background hues, not enough to detract from the 'green on black' aesthetic, but enough to give you a bit of confidence you were on the right system.
Prod and Pre Prod, Live and DR - so all four had different colours.
The test labs were just black...
Same here, but live is not restful background red hue, it's bright red. Something to give you that adrenaline rush before you hit the enter key, not after.
I'm also thankful that MobaXterm notices you're trying to paste more than one line and pops up a dialogue box asking confirmation instead of e.g. obediently pasting 20 lines which could be deadly.
It’s the wild colour scheme that freaks me. I mean, when you try an’ operate one of these weird black controls which are labelled in black on a black background, a small black light lights up black to tell you you’ve done it. What is this? Some kind of intergalactic hyper-hearse?
So you're conditioned to look at colours rather than developing a situational awareness?
- How will molly-guard save you if you don't know what workload the machine is actually running?
- Oh, colours - When the cluster automatically fails over, does it also automatically update the colour?
- And of course every PM always follows process to promote systems into Production
So you're conditioned to look at colours rather than developing a situational awareness?
The colours are part of the "situation awareness"
- How will molly-guard save you if you don't know what workload the machine is actually running?
When you type in eg "sudo shutdown -r now" while working under SSH, MG (which I only just learnt about and already have installed on a couple of servers) asks you to type in the hostname of the machine. If you're on the wrong machine then you'll type in the wrong hostname.
This happens is when people who actually do technical work on multiple servers at a time (or on a local machine and remote machine, or maybe ssh into one box and then from there into other boxes eg ssh into a gateway at a remote office then ssh into various servers etc in that office) forget which window they're typing into, which can happen when you're actually working on multiple things at once.
Some day you may be experienced enough to understand this.
(Yes, I've taken down machines I didn't mean to by either being in the wrong window of several, or by having backed out of/been dropped out of a 2nd or 3rd level machine and not seeing I'm not where I expected to be).
Dear Kiwi - The fact that you don't understand my questions and, by your own admission, have rebooted the wrong server on more than one occasion suggests that I'm more experienced than you :-) I don't trust that someone has marked a server correctly, that its use hasn't changed or that the workload hasn't failed over to DR - `uname -a` and `ps -ef` are your real friends.
I did it once, 21 years ago. I learned from it. You know that "Oh no" moment? For me, it now comes before I hit "Enter". Since then, I've worked for multiple large companies as a sysadm/engineer, supporting thousands of mission critical servers in complex environments.
If you work in a "perfect" environment where everyone follows process and colours systems according to workload, I'm glad for you.
Dear Kiwi - The fact that you don't understand my questions and, by your own admission, have rebooted the wrong server on more than one occasion suggests that I'm more experienced than you
Yeah um. No.
Actually I haven't rebooted the wrong server more than once and never said I did. I did say "taken down" which has more than one way of being interpreted. One was rebooting, the other was killing a network subsystem, both when I was working far more hours/week than is healthy (the sort of workloads that leave you hallucinating).
I know a few people who claim to have never been involved in a car accident. I note those same people also tend to use public transport and only drive down to the corner shop every 2nd sunday that falls after the 20th of the month. Your claims sound much to me like their driving experience.
B Franklin is reputed to have said something like "“The man who does things makes many mistakes". He or someone else said "He who makes no mistakes makes nothing." If you claim not to make mistakes, well...
(When I did it was only 6 or7 years ago, but I had been at work for somewhere over 32 hours already)
Boo!
In more articulated terms, the notion that "if you're paying attention, it'll not happen" is only accurate a set percentage of the time.
Every little thing that can be done to improve the chance of bringing that percentage up is worth doing (but it'll never be 100).
Lightening up a bit, around the place I call home there's a joke going like: "only those who don't work, don't make mistakes...and the mistake-free deserve a promotion". Which begs the question: are you a manager?
Argh. Please don't watermark documents with 'draft'. The first thing I'll do on receipt is remove the watermark to make them readable anyway.
At which point I'll be sharing the readable version and people used to your scheme will think it's your final version.
No, you do NOT colour-code safe versus dangerous in the two colours most likely not to be recognised by the most common form of male colour blindness.
Yellow-orange versus blue-green is a much better pair of colours; at least then the colour blind have a sporting chance of not cocking things up (and, moreover, much less excuse).
"how you recover from the situation"
Use a DB that allows transactions, and always start a new transaction before any potentially dodgy commands, then you can just roll-back.
Until the one day you're in a hurry and forget to start a new transaction, and end up rolling back waaaaaay earlier than you intended.
I once managed to delete the only copy of the 40-day churn file of a particularly dubious (and now defunct) ISP whilst I was working there. Fortunately we had plenty of backups of the Radius logs, so these were run through the Perl script which generated the 40-day churn file, whilst I lurked somewhere else in deep disgrace.
I have dealt with a few that would not have allowed this recovery and would have resulted in complete loss...
We have all done "something" on the wrong system though, myself included, although my turn was luckily entirely transparent to service...
Only after clearing out the data from an "evaluation" system did I learn that the evaluation had turned into full production. No matter, the system was mirrored to our DR site overnight using rsync, so all that was needed was to grab the cron script, swap source and destination in the rsync command, and all would be well. Unfortunately either the swap didn't happen or maybe it was done twice, but the end result was that the one and only backup (this *was* for evaluation, right?) was overwritten.
Was deleting the whole customer mailing list (complete with the 'do not spam' check) after getting over eager with the enter key before finishing the 'where' clause. (was removing email addresses for ceased trading customers which hadn't been done...well ever as far as I could tell so alot of unnecessary emails got sent).
They never did figure why the database slowed down that day whilst I tried to undo said change (also note to self.. Always write commands that can be easily rolled back).
I leanred long ago that whenever doing anything anywhere near live data, the first thing you do is
SELECT * INTO TABLE_BACKUP_DATE FROM TABLE
(with the caveat of: beware of cascading deletes on foreign keys)
That, on top of a full database backup if you can afford one, and writing any delete/update statements as select statements first, to do a check that they return the expected number of records.
I've got a few processes that start with a count based on the criteria and a rough estimate on what it 'should' be.
"40,000 rows will be deleted (expected: 400/day). Confirm?" is a great warning. Maybe that day did get accidentally added to the database 100 times, but...
Yup, when the database tablespace/datafile/whatever runs out of space you understand you're trying to backup the ginormous main production table.... while running a huge table scan that may have a certain number of adverse effects (like ejecting everything else from caches, create locks, etc. etc.)
Jokes aside, anything involving live data needs to be carefully planned taking into account how large the database is, what resources are available, the impact on running applications, and a "disaster recovery" plan in place if something goes wrong.
Buckle up cowboy, that's not how we do things around here...
"anything involving live data needs to be carefully planned"
That's practically change control. We already told them this was a non-impacting change - we don't want to draw unnecessary attention to this.
"taking into account how large the database is"
those storage guys never give us what we want. It's their problem...
"what resources are available, the impact on running applications, and"
The football start in 30 minutes and I've already got a pint waiting for me
"a "disaster recovery" plan in place if something goes wrong."
The project manager has spent the last two months drawing up a list of potential victims^H^H^H candidates for the blame game.
I have always wondered when MS SQL (and some others) would require a WHERE clause on any sort of update or delete statement, instead of defaulting to ALL. I have worked with databases, some literally decades old, that require either a 'WHERE TRUE' or 'WHERE 1=1' qualifier to be syntax-correct. Made this sort of error far more unlikely.
I had a new customer who had implemented a DW in SQL Express. They had hit the 10GB limit, but fortunately they didn't need transactions older than a few months.
I cleared them up nicely, then got more ambitious and wrote some SQL to delete all transactions from all tables that were more than 3 months old.
Oops - there go the dimension tables.
Fortunately, the reload from source 'only' took a weekend.
Years ago while I was still 3rd line ops I was asked to quickly build a test server to allow a very important project to test their code. All it needed was W2K, AV and Oracle DB (can't even remember the version now). Their devs would then deploy the DB and app to the host to test that it worked as it should before rolling it out to the, completely separate production environment (those were the days when we had five test environments to stage through before hitting prod).
As virtualisation hadn't been introduced yet, I grabbed the nearest piece of spare kit and installed the required software on it, stuck it on a shelf in the DC, connected it to the network and gave them the RDP IP and admin creds. Then I went on my merry way and happily forgot all about the box for about a year.
That was when the PM in question started frantically messaging me because the "production platform was running really slowly and users were complaining". When I asked "what production platform" and his response was "the one you built last year" it dawned on me that they had seamlessly promoted the code test box to prod without ever following process and thus, were running a very critical service on a Fujitsu desktop PC with a PII-350Mhz, 8 GB RAM, single PSU, single NIC and a single 5400RPM IDE drive... no backups, no redundancy or resilience.
I took some satisfaction in explaining that, as per the emails from the year before, it was a test server only, unfit for production and that he better go find some budget to buy a proper set of servers and licenses.
1GB RAM in a 486, B-S! I still have one in my attic and I remember when it was new that if you had 8MB in a desktop PC you were doing well, being as RAM was about £40/MB. Mine had 16x 1MB SIMMs, plus 2MB on the disk cache/controller and a top-of-the-range 1MB graphics card - all just so that Windows itself would run quickly, not for gaming. Cost about £2k in 1993 money, plus nearly another grand for the 17" FST monitor!
Surely the safety net to prevent these sorts of issues are the Release Management or Service Delivery leads that should have called out this lack of proper transition and resilience prior to the change being accepted into PROD and then on into BAU?
I think you'd be pretty hard pushed to totally blame the PM here if that sort of control isn't in place to catch it.
That really depends on who set up the system and how their chain of command connects to processes. We've probably all been in a similar situation. For example, several years ago, I wrote a simple web application at the request of a friend who worked at a relatively big non-corporate institution that I'm choosing not to name or further characterize here. To clarify, I did not work for this institution, but they needed this functionality and it didn't take me very long. In order to test it and make changes they wanted, I spun up this application on my personal webserver. When we were done, I gave them the code and instructions on how to put it on their webserver.
I think you know what happened next, which is that they did not put it on their webserver and probably lost the code. However, I forgot to take the files off my server. I used almost no javascript or complex images so the files took up little space, and the application did not get a ton of use and was not data intensive, so I didn't see any spikes in disk or bandwidth usage. By the time I realized they'd been using my server, my friend had moved to a different place and couldn't help move it across. My contact there did not respond to my warning email. I had to make the decision of whether I'd turn it off, thus breaking their system (they were still using it heavily), send a bunch of messages to get it moved to their systems meaning I'd have to spend a lot more time on it, or take the easy way and just let them keep using my server until such time as something breaks and I don't fix it.
It's still there. They're still using it. It gets included in my standard backups, and I think it's very unlikely to go down any time soon. I really think it might be a good idea to do something else, but that's work.
That would make sense, but as I already chose not to bill them for the time I spent actually writing the app, it's a bit out of character to charge them for something that really doesn't impact me. The server is running anyway, and I'm going to keep it going for my personal website and the various other services I've put on it. Their data doesn't take much disk or bandwidth and I could easily cut it off. I'm still surprised people don't take a look at the address bar and wonder why they've suddenly left the website of the institution concerned and ended up on my site instead, but I guess that, because I have an SSL cert on my site, they just trust the padlock and go ahead and enter the information*.
*The information collected is not personal, so there are no privacy/GDPR issues here. Also, as I wrote the code, I can say with complete honesty that it collects only what is needed and periodically flushes old records from the database.
Not myself, thankfully, but the tech from our 3rd party supplier. Building a new server, but there's a lot of data (2TB) to transfer onto the file server. Simple Robocopy script to copy the data, then test, transfer users across, and finally mop up any last files that have been updated on the old server.
Needless to say this didn't go well as the server build was royally bodged so it took 2 days to get it anywhere near working. The clues should have been there when he admits he doesn't know how to install SQL, but that was a different issue (thankfully the server has Raided SSD drives, we haven't noticed the loss in performance yet from the configuration it was installed in). No, the issue was once the server was finally declared functional we had 2 days of updates to grab from the old server and transfer to the new server. Simple Robocopy, grab updated files and copy across. Robocopy /MIR later and we lost 2 days of data, and to make matters even worse he hadn't configured the backups on the new server yet meaning we lost 2 days work. Thankfully the SQL data was on a different server.
Did almost exactly the same thing myself, only on a customer computer. Luckily it was almost brand new and I didn't lose anything important. Whew!
Luckily it was a customer who was pretty laid back and appreciated having something to kid me about for the next several years.
My first IT job after graduation involved keeping data in development and production Oracle databases consistent. One fine day I needed to export data from a table in the production database into a CSV file using SQLPlus and import that data into a new table on the dev database. My first attempt imported the data into the dev database table but into the wrong columns. I truncated the dev table and tried again. Again it failed. Thinking "how hard can this be", pining for lunch and getting a little irritable and trigger-happy, I truncated the table again and exported from the production table again. To my surprise it created an empty file. As the awful truth sunk in, I still recall to this day how I felt:
- I stopped breathing;
- My heartbeat skyrocketed;
- Hot and cold flushes coursed through me;
- I thought I might faint;
- I involuntarily looked outside at my car, fighting the impulse to pack my stuff, quietly drive away and never come back.
Fortunately the production table I'd accidentally truncated only contained a handful of records of static data which a colleague had a copy of. A few quick insert statements later and all was well with no impact to the business or users. Now if I'd truncated other production tables containing millions of rows of invoicing data.....
In a new-ish job. Called out of comfy bed early on Saturday. Big client who was watching closely had a problem. Please could I log on remotely on my spiffy IBM laptop (the one with the butterfly keyboard that sprang out to full-sized like something out of Thunderbirds or a James Bond movie) and do my little rat dance and make it go faster?
Sure, no problem I'll just log in and sit on the carpet and type some simple commands and ARGH!
Ten seconds of white-out vision and some quiet screaming. Then I remembered two things: I was working on Unisys equipment (which had a crappy SQL database engine but a recovery model second to none) and the client was employing ex-Sperry people so *everything* was configured by-the-book and working like a Swiss watch (only time I ever saw an honest-to-Cray Poisson distribution on a DMS database in the wild) so I took a deep breath and typed "RECOVER DATABASE TO <ten minutes ago>" and Hey Pasta! Instant Make It Didn't Happen!
Boss called to report odd fluctuation in database activity. I told him I was doing some preliminary groundwork before implementing my Make-It-Go-Faster plan. He hung up and I did it the right way this time.
Everyone impressed. Nobody mad. Sausages for breakfast. Total Win Scenario.
60 years ago I wrote a simple LGP-30 program to execute statements like +'a'b'c' (add a to b and put the result in c). Then I discovered that the math department was using this program in a class. Some of the original BASIC restrictions (identifiers: a letter followed by an optional digit) can be traced to this program.
Cluster was designed to allow one node offline at any given moment.
I was going to shutdown the top node for maintenance and was confidently plugged into the top node with a direct connection gently shutting down the nodes services. I couldn't be arsed moving the
KVM to the other nodes so I used ssh to check the others were still running ok. The bottom node seemed to stall after the ssh attempt so I heaved my fat backside to move the cable. Checked everything was ok and after this final check moved the cable back to node 1 and ran the shutdown cmd. Turns out the ssh attempt to node 4 finally worked and my lack of awareness shuttered the cluster for that morning. Thankfully my pale complexion that day earned me some forgiveness from the business.