Who in this trade has not felt that sinking feeling followed by cold sweat when you realise that you've done something you shouldn't.?
Me more than once. But thankfully so far, never a clear my desk and get my coat one.
Hey hey hey, it's Monday! The new week is but a caffeinated beverage away. Come join us in celebrating another Register reader's flirtation with career-ending disaster with a morning dose of Who, Me? It was the late 1980s, and our contributor, fresh out of university, had inexplicably landed a job as system administrator in a …
I've been trying to train a new guy on a (fairly) complex security system, when the powers that be decided to give him full admin access.
So he pings me for some advice on how to do a particular thing, to which I duly advise him on the most sensible approach - one he didn't happen to agree with.
Since he was now on his own, I simply told him that if he breaks it, he fixes it - it's the only way to learn. He has tried to lie to me several times about having broken the damned thing, claiming something else happened that caused the error - little knowing that I edited the logging script to create a duplicate on a remote server he doesn't have access to. I don't even need to trawl the log file, I just need to perform a diff to see what he's been up to and then tried to cover his tracks on. Silly twat might as well just send me a report :)
He still hasn't worked out how I know. Even if he does discover my edit, he would still have to discover the slight modification to a standard cron job that replaces my code every night :D
It's actually only the people with the permissions and ability to do the work that usually end up with the problem, because they are the ones with their heads above the parapet.
It's really important for IT departments to look after people who have such genuine accidents and there should be no consequence for them as long as they own up and alert everyone immediately. Because trying to keep something quiet can make a small problem huge as time goes by.
But well done to the author of the article. Well recovered.
My incident, under pressure was to repoint an ODBC DSN in Windows. It was a while back and I repointed it OK, but there are two.
"He still hasn't worked out how I know. Even if he does discover my edit, he would still have to discover the slight modification to a standard cron job that replaces my code every night "
Are you sure he doesn't know about el Reg? If he doesn't, by the time he's worked out what you've done he'll be up to speed.
I'm pretty certain he doesn't read El Reg, so that was my thinking exactly. If he can work out what I've done and undo it without breaking anything else, then he no longer needs to be watched like a hawk.
However, his current thought processes are still more focused on preventing anyone finding out about his botch-ups so I'm expecting the heat death of the Universe to occur first. It would never occur to him to put himself in my position and then work out what *he* would do in such circumstances, because it would require him to acknowledge his own role in the play in order to view it objectively.
Oh, that last paragraph is just 'class' - beautifully constructed!
We've all had a colleague like that at one point or other. Mine was quite a while ago and the ignoramus was so far up is own chuff that everything he did was brilliant, everything anyone else did was flawed.
The empirical evidence was the reverse, and everyone apart from him knew it. I'm glad to report that he did get caught out in the end, for the same reason (syslogs and command audit trails were forwarded to a third machine overseeing the entire shooting-match).
@Sir Runcible Spoon: "... it would require him to acknowledge his own role in the play in order to view it objectively."
I have my faults, but denying my own agency isn't one of them. Those who can't see their part in the world are little better than stoats (lightly grilled and served on a bun, of course).*
*One of my favourite Douglas Adams concepts - I still remember being reduced to a giggling puddle on the floor the first (and second, and third, and....) time I read it!
The third type are those who have observed other people lose all their data without backups, and so resolve not to have that problem. And to date, I haven't. I've got all my personal stuff dating back to the 1960s. The most I've lost has been a couple hours here and there ... but always at an inopportune time, of course.
Proper backups are a vital part of properly running any computerized system. However, I can make a case for simply having multiple copies (off site is good, cloud maybe not so much) of all your important personal files being all that's needed for the average single-user/family home system. The OS can be reinstalled, your pictures and personal correspondence (etc.) cannot.
Like many people, I'm sure, I used to have "rm" aliased to do just this: move the target(s) to a "~/.undelete" directory, from which old files were cleaned after a couple of weeks by a cron job. In fact I still have this arrangement for many of my Linux and UNIX accounts, though not, I see, for Cygwin. I guess it's been so long since I accidentally deleted the wrong file that I've never gotten around to setting it up on my Windows machines. I should do that...
I had another alias ("rrm", for "real rm"), which bypassed this, for situations where moving the targets under my home directory wasn't viable. I was always very careful before using it, though. Generally I started with an "echo rrm ..." first to verify that the globbing result was what I expected; then recall the line, delete the "echo", and hit enter.
Only once for me... Build task on a networked file system, unset environment variable in a different script than the one I was editing.
There was a "rm -rf $PATHVAR/*" line. That went... as well as you might expect. It only wiped out half of the file system before I stopped it...
Luckily, backups were taken of the filesystem, and were restored, and the offending line is now protected by guard statements, so it won't happen again. Unfortunately, said script, for which the only modifications I have made are the guard statements themselves is now referred to as the "<Baldrickk> script" - sigh.
I'm reminded of an occasion at work when we did an out-of-hours upgrade to our CRM system several years ago. Some unfortunate engineer forgot that the out-of-the-box setting for the payment platform on the Windows component was to point at the test instance, which doesn't actually take any payments. This needs to be manually edited to point to the live instance when the software package is installed or upgraded. This is usually done, but is forgotten from time to time. This particular occasion was just the first time, but the one that bit us in the bum rather hard.
Unfortunately, this oversight wasn't discovered for a few days (may have been a couple of weeks, I can't remember exactly), until it was noticed a handful of live customers weren't being billed.
The practical upshot of this is I modified the start script of the main Linux package to query the config file on the Windows box and refuse to start if the setting was wrong. This has saved our collective arses a number of times since.
My favourite was when a little-known OS bug silently trashed the encryption headers of every single disk in a very critical production file server. The worst part is that the machine kept churning away for some weeks afterward, and the bug was found during a routine early morning restart.
I was the one that executed the routine restart. Then found out none of the unlock keys would work...
Very sweaty palms, and an indescribable sinking feeling as some of my own rather critical work was on the machine. About 60% of the data was recovered from backup, the rest was just gone. Backup schedules to removable media were increased in frequency and disk to disk backup in the same server was banned. To this day the org in question doesn't consider anything actually backed up until it's been written to tape, reread with matching checksums, and sent off site. Anything with a spindle is verboten for use in backing anything up.
Took a long time to regain any trust in said OS after that fiasco. Icon for status of data. Still gives me a twinge of something not entirely unlike PTSD thinking about it.
Simply place the backup tape next to the drive array, and wait for some kind of data-osmosis to occur.
Of course, verifying the backup will be difficult, but I'm sure it wouldn't be the first time that backup integrity checks have basically consisted of crossing your fingers and hoping really hard.
Simply banning "something with a spindle" from being used in backups isn't a good practice. It's just a measure out of fear and misunderstanding.
I know, we get traumatised and scarred from past mistakes and disasters. Been there, done (and suffered) that.
Understanding each backup medium and realising its strengths and limitations and planning accordingly is much better!
I do not think this is your case since you've already stated following many other good practices, however once someone starts putting too much faith - without constant reasoning about "why" - in some well established procedure, disaster follows.
Yes. that's true (and I had condensed the ban down to a simple phrase for comedic effect) but each time active media (spinning rust drives, SSDs, even early R/W optical systems) have been evaluated they've come up fairly short on a number of key characteristics.
The retention criteria is fairly simple: if you assume only one copy of the data has survived a disaster, potentially after sitting on a shelf for the past decade (retention of static data, data may not be continually rewritten) do you want to hope that all of the delicate (static, EMP, power stability and corrosion sensitive) electronics integral to accessing that data have also remained intact? Or would you rather have a DR plan that basically states "if the drive is bad, use a spare"?
Sure, you can compensate by duplicating the active media in multiple locations, but then your costs start to spiral uncontrollably compared to good old fashioned cold magnetic/optical storage in a secure off-site location or two.
I've had stacks of SSDs just up and die over the years. Many of them would "work" until powered off for an extended period -- powering them up reveals no data. Same with hard drives -- if the platters didn't outright stick, the electronics tended to be unreliable.
Thankfully, (and thanks largely to the amount of fuckups I’ve had to deal with from others), I’ve not actually done really bad stuff to anyone else but myself. We have shared computers , and had a problem with users logging onto them, then pissing off for hours, leaving the computer unavailable to others. My boss wanted a solution. Preferably one that cost as little as possible. I wrote a small screensaver that when it was activated, did a forced shutdown and restart. Then, while testing it, I realised I hadn’t saved anything. So, I lost several hours work.
For the second, I need to explain something. Because of some partition format problems we had with our install of windows, we had a bootable disk that when run, wiped the partition sector of the internal hard drive. I booted my work pc up one day, not realising that not only was this floppy in the floppy drive, but my machine was (rather unusually) set to boot from drive a.. I realised as I lost several year’s work, not all of which was backed up.
As soon as I noticed rm -r * in the text I could see what was coming!
I've narrowly avoided that one, but back in the late 80s I was working for an Apricot dealer supporting a mixture of IBMs, clones and Apricots. The problem was the latter had the HD as drive A not drive C. A couple of times I went to do a high level format of a floppy and started formatting the HD instead.
Fortunately it was easy to spot and stop, Norton Undelete was effective provided you knew the first letters of the filenames and as the software installed was our own I could figure it out.
Same here - saw the rm and knew what was coming.
I have a non-critical but annoying problem on my Mac where an app crashes occasionally due to a problem with directory access. The Dev isn't really interested in fixing it so after putting up with it for a while I decided to try to get to the bottom of it. It relates to log files which the app periodically clears out with an rm. My first thought was to try to change the location of the log files so there wouldn't be an access problems. My second thought was "never f**k with an rm command.", and I left it alone. Better the odd bit of annoyance than lots of unexpected disc space.
I usually start to type rm -r /blah/blah
, then realise what I'm doing and put a 'z' at the start, so it reads zrm -r ...
, so even if I accidental hit enter, no harm will befall me. Hopefully.
Of course, the other day I ran an rsync with the --dryrun flag. Those paying attention will notice that it really should have been --dry-run. Fortunately it gave me a syntax error instead of running.
Try echo: echo rm -rf /${hope_this_variable_is_set}
Btw. I have done more damage using mv. Move several files into directory blah: mv foo blah, mv bar blah, etc. And then I realize blah is NOT a directory. So I repeatedly overwrote the file(!) blah. Well, you still have the last file... Me ------>
I had an Xi, now that was a great little machine.
10MB hard drive, GUI, C interpreter (!!), C compiler, BASIC compiler, dBase, Multiplan, WordStar and still 5MB of space for data.
The A:/C: thing nearly caught me out a couple of times, but I never did. On the other hand, Norton Undelete was a tool no serious PC user was without back then. That and Xtree.
My favourite is Double Commander, which while it's rather "rough and ready" compared to some of the other NC descendants, has at least the advantage of being the same across Windows, Mac and Linux, which when you have to work across multiple operating systems on a daily basis saves some daily wear and tear on the grey matter.
"Norton Undelete was a tool no serious PC user was without back then"
So true! I never went anywhere without my complete set of Norton Utils back in the day.
Happy to say those very same disks are still in the same box and have been resting for many a year. I darn't throw them out, you never know....
I have been the wielder of the Norton disc in the labs I've worked in and at home. Called in to undelete things after people unwittingly held down the shift key to select things to delete which included the item they had selected before. All get dragged to the Trash and it told to Empty.
I got my users pretty well trained to do NOTHING until Undelete could get working. I had the disc because expecting IT to help in a timely manner and on a Mac was a pipe dream.
I'm still not entirely sure how I ended up in the role. Being interested I think and being bothered enough to get informed. I also have a problem solving brain which likes solving problems, other peoples? fine.
You have been warned. I have learned that sometimes people don't want their problems sorted. Their problems are their crutch and their excuse.
"You have been warned. I have learned that sometimes people don't want their problems sorted. Their problems are their crutch and their excuse."
This is a true pearl of wisdom and deserves many upvotes.
I'm embarrassed to say it took me many years to learn this about my useless, idiotic, in-laws. All those wasted years and effort on trying to help them sort out their problems, only to wonder why they would undermine my efforts the moment my back was turned, or simply develop new problems with which to fuck up their life.
Trouble is, one of them's now dead, the other is in a home with dementia, and I'm *still* sorting out their shit. Still, their ability to create new problems is now limited.
Sorry to hear that - it is a very uncomfortable feeling. The odd thing is, the sibling that complained will be really surprised that you don't act towards them as you used to, and get upset when you refuse invitations to family gatherings, etc.
They'll also scream like hell when you say, "Fine, you deal with the finances from now."
/bitter experience mode
"They'll also scream like hell when you say, "Fine, you deal with the finances from now." "
That has seriously crossed my mind. I've spent countless unpaid hours on all the admin, not to mention cleaning and fixing up their house so it can be sold to pay health care costs etc. and all the sibling seems to do is create situations that can be twisted to suit her antagonistic viewpoint - which she then uses in a smear campaign.
Thankfully we saw this coming and have already appointed a solicitor to help us out.
Amazing that the default for rm wasn't to query when in interactive mode, and you are doing recursive/force or are root in a 'sensitive directory'.
"Are you sure you wish to recursively remove all files in / as root? Y/N"
But then people would run 'rm -rf --do-not-ask *' or something instead.
rm does, now, protect you from recursive deletes of your root FS. You have to pass it "--no-preserve-root" in order for total destruction.
https://www.gnu.org/software/coreutils/manual/html_node/Treating-_002f-specially.html
I wonder how many stressed-sysadmin-hours that feature has saved :-)
It won't stop you running rm * in /usr/bin
Thankfully it will still let you run most of the OS commands you have just deleted...
... Giving you a chance to retrieve most them via FTP (symbolic links don't work well) from it's sister server on another site until you can do a restore from tape.
Good test to show your backups for the backup server work
Without intending to anger the gods and invoke Murphey's law, I've never fallen victim to the 'rm -rf' 'accident'. Probably because of tales like this.
Plenty of other mistakes.
In one case (not my mistake although I ended up fixing it), half a BIND master zone file got removed through a vi accident; unfortunately the result was valid and half the names disappeared! Fortunately I knew the name and address of the backup server, so we were able to rollback that single file. I've shown a strange obsession with filesystem snapshots ever since :)
Same here: the horror stories around R-M have made me very wary.
On the other hand, I once gaily installed an experimental data compression NLM on my NetWare server, as the 20MB full height Seagate was filling up, and I couldn't afford the the squillions it would have cost to buy a bigger one.
Guess what? It was very good at compression, decompression, not at all. Lost everything before 1991. I still have an ARC file somewhere that nothing has ever been able to decompress...
Well, I haven't yet done an rm that killed my own files, although I can take the blame for someone else running an rm that lost them their files. When at university, I was helping a younger student in the second programming course who was getting disk quota errors. The reason was that their code was not working very well and had been dumping a lot of cores, which had not been deleted. We used a couple tools that produced different core filenames, so "rm core.*" wasn't enough. So, of course, I spoke the required command for the user: "rm, then a space, then asterisk core dot asterisk". Unfortunately, another space got entered, and not in a good place. And now I no longer read code or commands aloud.
For the record, I had some extra access to things and I was able to get the student a relatively recent copy of their work. I'm not sure how they felt about me after all was said and done, but as this was the due date for the assignment, I believe there was much panic from everyone.
Well I'd just like to say that recent patching on some Windows 2016 DC Hyper-V hosts has left them utterly unusable. We have some servers that decided either not to see the SAN after patching, or just not to play nicely in the cluster(s). Even better, uninstalling the most recent patches has left them in a boot loop.
Now I have to go onsite to the datacentres to rebuild them from cold because they're too stupid to acknowledge that the internal USB ports aren't actually removable drives. Luckily I need to do some storage work imminently so I'll roll the whole lot together, but I've burned most of a week already on this pish.
I think it's time to petition for new servers.
I think it's time to petition for new servers.
Non-Windows, hopefully...
I also don't like to update my hosts. They're running without issues, so the only thing that can get updated, is windows defender.
And before I get lambasted for not applying security updates : I don't trust Microsoft and their gung-ho approach to windowsupdates either, see what've happened to Suck10
I find it amazing that installation instructions (from software companies that should know better) start with "go into su mode.... and issue the following instruction". The developers have their own sandbox machine, ("just spin up a cloud instance for the test so if you break it, it doesnt matter") and have clearly never been near a production environment.
At the time of the story, being able to "spin up a cloud instance" was decades away. I guess back then each Dev would have had multiple physical machines to work on.
I agree that the current vogue for a Dev spinning up a VM somewhere and then assuming a "clean build" and/or making big changes to libraries config etc. is a serious PITA!
This whole "cloud" thing is just another name for service bureaus renting out centralized computing.
I'm waiting for someone to reinvent a PC (no, not a phone) to be followed by on-prem servers to be followed yet again by yet another name for bureau service.
What goes around... I guess we will have to wait for the realisation of the true cost of using someone else's computer to outweigh the trendiness of using the current hip thing. Perhaps when the whole edifice collapses when a hapless fool innocently changes something which is replicated endlessly around the globe by aggressive dependency policies it might be cause for thought. Or perhaps not.
Perhaps when the whole edifice collapses when a hapless fool innocently changes something which is replicated endlessly around the globe by aggressive dependency policies it might be cause for thought
And that gives me a possible solution for a problem in a SF story of mine, how a particular attack vector hits computers worldwide when originally meant for localized targets.
Not the 0.5 Internet. That would have been a pre OSI model IMP version (See 1976's BBN Report 1822), which most agree culminated in Internet 1.0 and included the use of TIPs ... The Morris Worm ran on the later TCP/IP Internet which went "live" on January 1, 1983. This can be considered Internet 2.0, and was fairly mature by 1988.
As noted in "Zen and the Art of Motorcycle Maintenance", instructions are generally written by the least valuable member of the team, one who can easily be spared from regular work, and indeed one whose absence will improve the progress and quality of said work.
"Stir in the fact that changing a 1Gb drive back in the day was a two or three-person job"
Nah. By the time the SPARCs arrived, that was an easy one person job. The first SPARCs were Sun-4 models, with full height CDC Wren 5.25 SCSI2 drives. Not lightweight by modern standards, but hardly the 8 inch monsters that CDC had made a couple years earlier. Some early machines even had 3.5 inch half height drives. I can't remember if they made it past Pilot build and Alpha test prior to Sun actually labeling the machines SPARC though ... lot of water under the ol' bridge.
"Don't forget the 1G IPI drives."
How many of those did Sun sell with SPARC systems? Maybe a dozen IPI drives total?
Also, for a lot of the Sun SPARC product line, the "hard drive" was really just a steel chassis that a power supply and one or more HDDs fit into. For example, my own pre-SPARC 3/470 "Pegasus" is what they called a "dual pedestal deskside system" ... One box holds the VMEbus+cards and it's power supply, a full height 5.25 CDC Wren drive, a floppy drive, and a tape drive. The other, connected via SCSI cables, contains four more CDC Wren drives and a couple power supplies. The individual drives are easy for one person to swap out, as are the (hot pluggable!) redundant power supplies ... but the entire contraption takes a couple people to lift. The 19" rack mount systems were built the same way. If they were mounted properly, you could swap out dead drives single handedly.
If they were mounted properly, you could swap out dead drives single handedly.
Unlike the 7914 drives for my HP3000. Lifting one up and trying to also line it with the pins on the sliding rails, is not a single person job. Well, it can be done, I've done it.
It would be amusing to see look on H&S person's face if that was being in today's corporate environment.
Those 7900 drives were early '80s, while SPARC was late '80s ... amazing how much shrinkage storage experienced in those few years, no?
I wonder how far behind we'd be if today's elfin safety nazis had come into existence in that decade. Probably still stuck with the VLB bus and half height 5.25 drives ... with silicon being built on 6" (`150mm to you euro-types) cookies.
Yes they were. I have 7970E and 2 7914Rs in the rack. Just found copy of the install manual...
The disc drive weighs approximately 67kg (148lb); more than one person may be required to install it in the subsystem cabinet
May? Well I guess they were right. i have no idea how I managed to wrangle the drive into the cabinet, let alone align it with the pins in the rails.
If I recall correctly the drive is 132MB or so. Hmm, wonder if the 3k stil works, or more to the point if the tapes are stll bootable. Inrush current for the storage rack is quite impressive. Always flickered lights. Once spun up, it was safe to turn on the 3K itself. Joys of single phase in home environment.
Yes, and it didn't take long for storage to shring to half height and then down to 3.5".
I reckon you're right that we'd be nowhere close to where we are now.
Its not a directory and it wasn't me but. I remember well sitting at my desk eating my morning porridge and imbibing caffeine when a systems bod complains his web management interface is down. After a quick dig around it looks like internal DNS is down. Or at least its not resolving.
Check the DNS box, and there's no root entry. In order to tidy up DNS and make it standards compliant a new tech removed it, after getting the OK from the change management group.
ITIL, Schmitil.
Anon to protect the not so innocent.
The piggy bank of favours was cracked open: Scott and the engineer had a pretty good working relationship stemming from Scott's willingness to overlook an occasional miss of the contracted call-out times
Reminds me of the time one of our lab Macs needed to have its OS upgraded by a field engineer so they could install the latest instrument software; the hard disc in the Mac then chose a seriously bad time to die! The scientists were furious but luckily I was around to point out that it was very bad luck on the Field Engineer's part and it could have easily been me. I managed to find a new spare compatible disc from our spares cupboard and saved the poor guy's bacon.
I really should have got a few of these from him... -->
I can't recall the exact details (I'm either getting old or my brain wilfully suppressed it), but I recall having the odd problem with the Wyse terminal I had with me to work on mainly the pizzabox SUN server variety we used for creating firewalls. One particular instance was driving across the UK for 6 hours or so and then return the same day, only to hear the day after that it didn't work - if I recall correctly switching off the terminal before disconnecting it would send a STOP signal to the box so it would basically sit there suspended until someone hooked up again and told it to continue.
I can't for the life of me imagine why that was implemented, but it taught me never to walk offsite without testing a client machine's access to the Net.
I can't for the life of me imagine why that was implemented,
At the time it was pretty commonplace for a <BREAK> signal from the console terminal to perform the sort of non-maskable interrupt that we associate with Ctrl-Alt-Del today. It was the "stop everything and give me back control" command, useful if the server was hard hung. <BREAK> was sent by having the RS232 transmit line held at 0 (low) for a longish period (IIRC it was between 1 and 2 character times).
The problem was that many terminals would stop sending data and take the transmit line to a low value for a time when powered off, and the server saw that as a <BREAK> signal. I think that there was eventually a patch for Sun systems so that you could disable the <BREAK> response on the console.
Ahh, bloody decservers had a nasty habit of doing just that on reboot.
A bit of a ball-ache when used as terminal servers for a rack of U60s which would immediately lock up unless you remembered to disconnect them first. Of course then you had the problem of remembering to reconnect them all again afterwards, or be faced with a dead console when you really Really REALLY needed one !
I have a few:
1) Developer managed to do an rm -rf * from root, just not *as* root. Still causes a lot of damage. After we recovered the machine he very helpfully showed us what he had typed - and then ran the damned command again!
2) Took a call from someone in the US. Their database was down. I asked them to go to a directory and found it was empty. After a few minutes of questioning, I finally managed to get to the bottom of the issue. Apparently the filesystem was full, and the person had done an rm -rf on the directory structure that held the database! To make matters worse, the last backup was over a week old.
3) Had a developer decide to delete glibc on a Linux machine so they could install a new one. Hint. Linux really needs it!
4) Had someone do a chown -R from root as root. That was an amusing one to try and resolve.
And my personal cock-up.....
Problems with a disk under VMWare. I deleted disk 5 instead of scsi id 5. In my defence, it was on a cluster with shared disks and at the end of a very long and stressful day :)
Been there with a dev doing something strange to glibc libs. Fortunately not a complete disaster, but a bit annoying to have to sort out. At least another time when a dev suggested they wanted an updated glibc (and helpfully pointed to a tarball available online) another dev pointed out that wouldn't be happening for several reasons.
"What do you mean you don't like dependencies on nightlies?"
Cue application that has never run stable on anything but the developer's computer. (The developer who refused to ever really try the application, or sit down with people who did, and insisted on reproducible bug reports before doing anything about crashes.)
Anon, just in case.
Disk work is ... entertaining.
I once had a mirrored volume set up with the intention that one half would be in one data centre ("A") and the other half in the other data centre ("B"). Worked fine except for the one mirror where I'd managed to set up both LUNs in the same data centre; and to maximise the stupidity, both were in the other data ("B") centre with the server in "A".
This isn't a new story as such, but on topic. It also wasn't me what did it, but a colleague, honest.
Back in the day, working for a large ISP in the UK that still ran the *.co.uk name servers. Said colleague was adding a new customer domain to the file using vi. Since the customer domain began with an 'n' he was about half-way down in the zone file. All seemed well after saving and exiting the file, but reports started to (at first) trickle in that some domains were unavailable in the DNS.
The trickle turned to a deluge, and it seems that my colleague managed to 'delete to end of file, save' the zone file.
Obviously we restored from a backup, but it still took over 4 hours for all the domains to trickle down through the secondary servers to update.
The existence of DNS is a very good reason for not passing root access about willy-nilly. Not even when somewhat sanitized with su ... It's absolutely astonishing how many people think they know better than the admin who set it up, and thus can improve a system that has been running flawlessly for a year or more.
I wrote a simple screen editor for MS-DOS 0.96 in EDLIN, creating a bunch of text files full of preudo-assembler commands that, when concatinated together and redirected into DEBUG, produced the editor as a .COM file. No need for linking with .COM files. Why? Curiosity, of course. I was learning the internals of a new OS program loader.
Primitive? Absolutely! But try to remember that DOS was tiny ... It ran from 160K floppies. Most early machines didn't have hard-drives, and if they did they were probably only 5 megs. DOS was mostly useless as a program loader, until ver. 3.1 enabled the networking hooks ... But it was a hell of a lot better than dragging card-decks to the glass house and waiting days for the result!
As a side-note, I had already been using UNIX for several years (BSD on DEC, mostly) when the IBM-PC came out. We looked at each other & asked "What is IBM thinking? Thank gawd/ess it can't do networking!" ... the rest, as they say, is history.
From the way-back machine, I had a Commodore 64 and as a teenager had no money for the $35 macro assembler cartridge. So instead, I wrote a very simple one in C64 BASIC that supported JMP labels and such. But since the C64 didn't have a built-in editor, I also had to write one in BASIC. And it was based on EDLIN.
That calls for an addDomain.sh script that requires no manual editing of the actual domain file (obviously these days the domain file would be generated from a database of domains and blah blah blah).
It's amazing how poor most practices were in the past (and still now, unfortunately). Critical files simply should not be hand-edited, and indeed should have machine validation prior to deployment.
My 1976 ADM-3A has the green phosphor option. There was also a white phosphor option.
My LSI ADM-3A has white phosphor. And uppercase only (which makes me doubt if it was 3 and not 3A) Luckily it at least has 24 lines instead of 12. It also has CTRL in the more convenient for unix location. Never did come across one with the Tektronix 4014 option.
call me new, but I use different colored terminal windows...
I also had a lucky recovery from doing The Wrong Bloody Thing on the wrong RDP session.
Nowadays every server I remote in to has a tiled bitmap set displaying the server name and also which site it is. Helps a lot if you have got a lot of RDP sessions open.
Ah, HP-UX. My fondest memories of HP-UX stem from when I was on a project in Singapore and we had to interface with Windows. We also had a Red Hat box in what we were building, which gave me the idea to see if there was a HP-UX version of Samba as a possible solution.
After some digging I found a HP authorised CIFS variant, but our office at Orchard Road didn't have enough bandwidth (one line divided over tens of developers tends saturate quickly), so I had to get a cab back to my apartment where this new fangled invention called WiFi (then delivered over a card I had to insert into my laptop) gave me about 10x the speed I had in the office.
Those were interesting days :)
I remember a colleague who accidentally deleted the on-disk kernel image file on a running Solaris box. Didn't have any immediate effect, the system still had the file open for paging even if the directory entry was gone, but he had a tense few moments hunting round the systems on the network for one with exactly the same OS version. He then FTPed it back to the boot directory, and after a reboot at a convenient time he heaved a sigh of relief when it rebooted OK.
Wonder if there's some way to re-create the directory entry in that situation, the data's still on disk. Maybe could be achieved by hard-linking the process's file handle? (I'll admit the copy-known-good-version approach sounds less risky.)
A friend somehow managed to delete /vmunix on a largish Sun system back in 1988 ... fortunately on a Saturday afternoon. Also fortunately, he had enough sense to call me (and tell me the truth!) before he started "fixing it". All was well come Monday morning, and I had free beer for the rest of the month.
Disk clone...
Back in the days I was stationed at a toll plaza. Things were fairly quiet most of the time, and I was playing around with OS/2 Warp v4 (yes, long ago).
Then a dev asked me to clone one HDD of a lane over and configure it for another lane (the clone).
It was IDE drives, set one as master and one as slave (I was never a fan of cable select) and run Norton Ghost. I also made sure the correct HDD (original lane HDD) is master and the drive with OS/2 on it was the slave.
A quick <tappity><tap> and away goes Norton Ghost and clones the HDD.
Removed the master HDD, set the slave to master, reboot... and up comes OS/2
Suffice to say I made another clone from another lane, but it was a success, and I fixed my own boo-boo.
Lesson learnt. If possible, use different sized HDD's.
Also, with Clonezilla you can identify the HDD a bit better as it gives you a more verbose description of the HDD you want to clone to/from.
I miss those gay, carefree days without spam, cryptomalware and shouty bosses/clients.
A few years ago I was in charge of keeping an eye on an ELK stack which was centralizing logs across a few dozens of services. Thing was a bloody waste of processing power and network packets most of the time, but I still had to keep it running because the logs were used to draw shiny graphs for the PHBs. We only kept 7 days worth of logs, with a cron job cleaning up every night. Then one day I had a request to keep the logs of a certain date on hand for one of the dev, so I stopped the cron job. So a few days later, I had to clean up the extra logs, so I connect to elasticsearch, the storage layer, and run for the first time a delete command. When it took way too long and when I saw errors starting to appear on the dashboard, I knew I've made a mistake. I realized I had deleted everything in elasticsearch, since I hadn't fully read the documentation (I would have learned that the delete command is not search-then-delete, it's just a delete everything, for example). So I fessed to my boss, warned the dev we were having a few teething issue this day and did a rollback from last day's backup (I was happy that day that the automatic backup on AWS were enabled). In the end we didn't lose much, just a few logs, that weren't that important anyway.
Haha many moons ago I was planning a system upgrade. Being the diligent type that I was I restored a fully functional replica of the live system in the lab from backup, went back to my desk and added a static route back to the lab environment rather than the real environment on my work laptop. As I only expected this to take an hour or so I didnt bother to make the route persistent. Queue a windows update whilst I popped out for a crafty smoke break, followed by me uninstalling the software package from the live devices instead of the lab devices.
I look back with some fondness now given that it was 15 years ago, it was a very different feeling at the time!
On one occasion a $software_vendor was in to do an upgrade. Upgrade successful… then clean-up with rm -rf * as root in /
It was a very clever system that replicated data very fast (for the day) to a partner machine for failover in case of failure. Sadly the deletes were replicated just as fast.
Okay, so the good news was that we had backups of the data and we knew they were good because they were tested. But the OS needed a bare metals install first. The OS was SCO Unix. It came on floppy disks. Boot from the first and see the message…
"Please insert disk 2 of 96"
It was a long night.
PDP11/44 at uni in the early 1980s which (as a PhD student) I somehow ended up (sort of) running because nobody else would. Users who had to access image capture hardware needed superuser rights (or whatever this was called, it was a *long* time ago). Some users had multiple IDs. One user -- no, definitely not me -- who had hogged too much disk space (2 x 20MB drives shared between a dozen or so users!) decided to save his data to tape and clear up his disk areas, DEL [*,*]*.*;* deleted all files for all users (not just him) including the OS (RSX-11?) -- or at least, deleted all the file allocation tables, the data was still there, but of course no OS commands worked any more since the commands ran from disk. And there were no proper backups, tape drives were mainly used for data storage, people were supposed to save programs on 8" floppies but rarely did. Months of work for multiple postgrads circled the digital drain...
Luckily the system debugger just happened to be loaded (in 64k of RAM!), and could talk to the disk and printer, and could print out (to the line printer) the absolute block address, owner and filename for each disk block. Said idiot user had to sit there and manually reallocate all blocks by hand to rebuild the FATs, took him most of a weekend, sweating all the time because if anything happened or the debugger crashed there was no way back except reformatting the disks, reinstalling the OS, and losing all the data.
I had someone do this on a box I was admining (not me this time).
Luckily I had a similar un-wrecked system. I captured the correct perms from the unborked machine and wrote a script to re-permission the broken ones, had to boot of tape first as the setuid perm had been removed from /bin/login and so you couldn't actually login.
Not as severe as "rm -rf", but I rebooted the production server thinking it was the test server.
I was at the test server console. But I was reusing a terminal window that I didn't realize was SSH'd to the prod server. First inkling I had a problem is when the test server didn't shut down its display. SSH closing was the second. The dismayed phone calls from the prod server's user community was the last.
At least it was a reboot.
Accidentally shutting down the home directory server.
Good news, have access to server room. Bad news, that's irrelevant as it's a VM and you don't have hypervisor access. Fortunately the person who did, although on holiday, a. was checking email, b. had five minutes to connect and press start on it.
(Other good news, things are set up so this doesn't cause data loss, although it does cause a certain amount of thumb twiddling.)
After doing two years of support at a VAR in the '90s and spending many a Friday afternoon on the phone with some hapless department manager that thought they would just clean up the chaff and inadvertently hitting enter when this is on the line root@doomedserver:/#rm -rf * on the production server, I've always been able to manage a pwd before running the poison rm as root. Now, if you want to talk about borking remote networks, that I have stories about... I don't always remember to do a "restart in 5" Luckily, never at a site that required a plane or helicopter ride to.
There was once a Solaris bug which, when running
"zpool create" on an existing zpool would wipe it and start fresh.
And I was building two large clusters with a lot of zpools.
And I'd built dev and was in the process of building prod, with the same script and same names.
And I got asked to make a small change in dev just as I was about to run the script on prod.
And you can guess the rest.
But I did fess up, Sun (as they were then) admitted it was a bug as well as my own stupid fault, and I got to stick around to fix it, which took a few days.
Two node Oracle RAC Solaris Cluster. Indirectly serving tens of thousands of users.
Junior admin decides he needs to boot one of the nodes (during the day, it was that kind of place).
What he didn't realise was the quorum disk was offline (that kind of place).
So, before boot cluster has 2/3 votes and therefore quorum.
As soon as one node is goes offline cluster has 1/3 votes, and therefore not quorum.
What happens next is obvious to anyone who knows cluster but is termed a "split brain" or at least a mechanism to protect against one.
Remaining node "takes one for the team" and shoots itself in the head.
Database offline, many many many people disadvantaged and a specific procedure needing to be followed to bring it back online.
A while back I performed an in-place upgrade on a client's information management system. They were a not-for-profit and had invested little in IT; consequently the "server" was a random desktop host A with an at-capacity HDD, and file store itself physically stored on another hard drive in host B, UNC accessed in the software. Host B's HDD was also close to full.
Upon completing the upgrade, I wrote up a report along with nice drawings showing their system architecture, and a strongly worded caution that the file store was NOT on the "server" and to make sure host B was always switched on and DO NOT TOUCH THE FILE STORE FOLDER.
Whilst on call over xmas some time later, I received a panicked phone call from the client. He'd run out of space on the server, needed to store some things and found another computer with a big hard drive they decided to UNC map to...once he deleted some "file store" folder which obviously wasn't important, so there'd be enough space to use. You can imagine why he'd called our emergency number.
I gave him the bad news and suggested he go to his backups, after which I would assist in fixing the app side. The good news was, he DID have backups, it was xmas and nobody was there except him so there'd been no changes in several days, and he managed to recover everything. I then go to work fixing a bunch of in-app links - scriptable - and went back to watching the cricket while thinking about my on-call pay ticking over as the script ran.
I don't know if he was ever required to explain the bill my company would have charged (or, more likely, a big chunk of time deleted from our balance of paid-for-in-advance support hours).
I have an unpleasant Pavlovian response to wildcards or variables on the same line as an rm.
Now that I am old, I always do a recursive rm in two stages. Move the files to be removed in to a deletion directory with a nice clear, unfancy name (no spaces etc), Then delete that single directory completely explicitly, using tab to complete the name.
It sounded great right up until the tab completion. More than once I've tried to tab complete a key command (stopping just in time) and not noticed the tab didn't actually complete.
So, for instance (greatly simplified, this tends to be a problem in cluttered directories versus simple example ones):
$ ls -R
delete:
rubbish
morerubbish
demonstration:
cooldemo
de:
source.c
key.dat
$ rm -rf de [TAB]
oops....
A very long time ago in my first job, fresh out of university, I was hired as a software engineer and I was writing code in Fortran on a PDP11/34 in an electronics lab for a bunch of engineers.
I deleted all of my new bosses program files, just his files, in the first week of starting work.
There were no backups, despite the PDP having removable hard drives no one had ever thought to make 'A' backup let alone regular backups.
There were, however printouts, fortunately these were the early days of computers and disks were only a few megabytes in size, so while the deleted files were important they weren't so massive they couldn't be typed back in.
So I spent the next week, laboriously typing them back in, making sure they compiled & produced expected results.
Then I sorted out a backup regime.
After I'd done all this, my boss informed me that not all of his programs had compiled before & he was pleasantly surprised they did now.
I love those OSes where the login prompt can be dully customized with RED BOLD fonts... so you apply them to ROOT.
While in Windows, 99.8% of users run the thing as admins, out of the box,.where even the simplest DELETE keystroke can bork your system... so MS invented (or hell, they stole, most probably) the Trash Bin.
The trash bin saved me more than once, must admit.
they stole, most probably) the Trash Bin.
Considering I recall seeing a trashcan on a Mac back in the Win3.1 era, I'm guessing "stole".
EDIT: Yep. Wikipedia says Apple Lisa had a "Wastebasket" 1982, MS-DOS 6 (1991 at earliest) had "delete sentry", then Win95 actually had a Recycle Bin. Definitely "stolen".
Simple job, one of the cluster members was unwell and hung on a reboot and was like that for months, me being dutiful decided to fix it,
visited the DC, disconnected its sync and data cables, rebooted it, rebooted into an older image, binned the broken image, copied across the same as on the live one & booted it,it came up, i checked the to make sure it would not become live when i rejoined the cluster, connected the sync and data cables and observed on the console the formerly broken ASA copying its out of date config to the running ASA & remained in back up mode.
Massive Oh $%^&, luckily i had taken a copy of the original live ASA config before i started work and plugged in its serial cable, the console session was still live and i pasted the former live config in. Write mem and the config from the once broken ASA over wrote it again. I then pulled the other ASA's sync and data cables, pasted in the config to the live box again, wr mem, consoled into the other ASA, wr erase and reload, booted and brought it back into the cluster.
Never seen that before or since, it was the kind of unbelievable event you'd expect from a newb. Luckily i resolved the issue quickly enough that no one noticed, turned out the ASA was broken before the cluster was moved from another DC and was installed faulty. The new DC had the ports rearranged so it was connectivity was definitely broken as the old config had the wrong port assignments.
It was only last week that I hit f5 on:
delete from table
where field=x
..but had accidentally only highlighted the delete line in SQL Server Management Studio. 200,000 rows were gone in the blink of an eye.
Luckily this was only on a dev database and our dev environment is such a mess that no-one really uses it. Anyone who noticed anything whilst I frantically scripted the data back into existence probably accepted the weirdness as normal.