Could have been worse ..
Bruce could have been working at Hawaii Emergency Management Agency last January.
Welcome to Monday morning, dear readers. We’ll try to make it bearable for you by offering you a new instalment of “Who, me?”, The Register’s column in which readers share stories of having screwed things up. This week meet “Bruce” who told us that “Many years ago I was a junior sysadmin for a large battery manufacturer and I …
This post has been deleted by its author
I've known more than a few people to do this.
I once told a soldier the portable version of a server was ready to be shut-down and packed up for deployment, he dutifully walked into the server room up to a (very) non-portable 42u rack and shutdown the servers in that. Cue calls to my phone from across Blighty asking why systems were down. Thankfully, they didn't take too long to bring back up, but I did have to explain what had happened to some much higher levels.
That was before the days of mollyguard, but I now make sure it's on everything to help avoid accidents (not sure it'd have helped in that case though)
One of my wonderful moment was working with a group testing a prototype communications analysis syeyem that I had designed and was responsible for testing in the field. It was a very wet and cold day in the middle of the British 'wilderness' (as it is).
Performance was good, all of the teams identified safe places to site their equipment and there were no obvious complaints about (a very prototyped) UI designed to be usable with gloves and in a somewhat hurried manner.
But I was there for the last week of trials and though after watching people using it I would takle myself off to talk to some of the remote groups and ask them about their experience (hey I was all that ouldf be described as a usability assessor as well as technical deigner). The last team I saw were perched on the side of a hill with a bare gap between protecting woodlands. It was wet and vey cold. I was pleasuably surprised about their enthusiam. Universally approved enthusiastically.
I tried to drill down on why and was somewhat chagrined by their repsonse. It runs so hot that we tackle turns to warm our feet on it!
Neeless to say we did manage to get the power down, and the prototypes were deployed in Bosnia 4 months later......
> Soldiers take things very literally. Never EVER label anything as "BOOT"
Yeah, to be fair to him he was just having a bad day. He knew more than enough about the systems to have not made that mistake, just wasn't really with it that morning.
Not that that made it any easier to explain up the chain, of course.
Bridges are portable. We took them to bits, moved them somewhere else and put them back together again.
Just because we needed a load of lorries etc does not make them any less portable. Nowadays, they would probably sling larger chunks beneath Chinooks and spend less time stuck in muddy places.
I am not sure I can define what the RE officially did not consider portable but the limits will have only increased in the intervening decades!
Holes are also portable. Especially when a staff sargeant tells you that you dug his hole 3 inches too far to the left, pulls out a measuring stick and demonstrates that you dug it in the wrong place and 2 inches too deep.
This in driving sleet on a mountain in the Brecon Beacons in February.
In that case the soviets had portable factories. In the face of a German invasion in WW2 they completely dismantled many factories in European Russia and reassembled them in the Ural Mountains and beyond. I've always been impressed by that.
Railways were apparently the key to moving the factories.
"I once told a soldier the portable version of a server was ready to be shut-down and packed up for deployment, he dutifully walked into the server room up to a (very) non-portable 42u rack and shutdown the servers in that"
In fairness to him, Soldiers tend to get used to carrying 30kg packs. He might have a different idea of portable to you and I.
I have created tiled bitmaps with the server's name on it (eg NODE1, PRIMARY DOMAIN CONTROLLER etc), so if you log in to a server via RDP you can instantly see which server it is that you're working on.
And, yes, this was preceded by me rebooting the wrong server. Now I can instantly see which server I'm working on, and this avoids mistakes.
Face it, a slew of open RDP sessions on your desktop will invariably cause you to issue the wrong command in the wrong window. Fun.
> background was red with pictures of bombs on it.
Suse Linux had this. IIRC it was brought in after people did things as root, not recognizing they were, often enough with serious consequences.
After some major terror attack (don't recall which) it was removed.
"I have created tiled bitmaps with the server's name on it"
We tried that at a customer on their RDP servers, so users can quickly look at the desktop to tell techs which server they are on.
Turns out roaming profiles will cache the background image, even if it's set by GPO at the computer level.
DesktopInfo is a wonderful tool.
Just come up with a template INI file and stick it somewhere all RDP users can read it, create a shortcut in ProgramData...\Startup to launch desktopinfo.exe for all users, and bake that into your gold image. Easy to package and distribute as well.
Then you get the name of your system as big as you want on screen - colour code for prod/non-prod if you're fancy, and some cute at-a-glance statuses if you want those as well.
In late 1977 I managed to take down all the PDP10 kit at Stanford and Berkeley with a software upgrade. Effectively split the West coast ARPANet in half for a couple hours. Not fun having bigwigs from Moffett and NASA Ames screaming because they couldn't talk to JPL and Lockheed without going through MIT ...
Taking down TOPS10 was so easy a luser could do it by assigning too many disk name aliases.
Mostly done for shits and giggles on last day of term with the added entertainment of super-lusers going to the computer centre to wrongly claim "I've just crashed the system".
These days that would probably be terrorism or some serious offence.
...your leader is worth following. Screwups will happen. But will grace and a second chance happen as well? If you find these in a leader, make sure you follow that person.
Bet the admin here never, ever made the same mistake again; performance across the board probably amped up as the lesson drove home the seriousness of the job.
I encountered a great leader once, in my first year of college working in a copy and print shop. The owner - a recent immigrant from Lebanon working three jobs at once to get enough cash to bring his family over - always seemed to be a hard man. But one after one all-nighter running a $10,000 job I realized all too late that I'd screwed up the whole thing, and lost a major client. Margins are razor thin so we ate something like $9,600. When Mr. Hammad came in, I just had to press my "man up" button, tell him what I'd done, and wait to be fired. Instead he stared at me for a very long time, and took me in the back for a cup of tea. His one question - that still stings across the years - was "So... tell me exactly why you are so careless with our money? Our paper and supplies and our customers? Did you respect our customer? Is that what you want to be?" Then "I should fire you but instead I want you to stay here and show me who you really are" I wasn't fired and ended up running the business.
Guys and gals like that are tough to find, but the world really needs them. So try to be one.
Everybody on the team from server admins right through to data input should know that if they screw up then there will be no negative consequence on their career if they own up and alert the rest of the team immediately it happens.
Because fear can cause cover-up and attempt to hide the problem, and then the problem can compound out of control the further in time you get from the error.
I once killed someones SQL server (and hence the app that was accessing it) by running a not-inconsiderable query. Ordinarily, this would have been fine, if slow.
The kicker was that the server had a dying raid drive. This would have been picked up in the normal course of events by the engineering bods and replaced, however that hadn't happened yet.
The extra load combined with the slowdown from the dying drive ground the system to a halt.
Cue some frantic work to get things back up, and a replacement drive sent back out ASAP.
This type of thing is so easily done the only real safeguard is a fully redundant system with fault tolerance. It still baffles me today that major transport operators, banks and so on experience outages when a correctly architected and implemented solution should keep outages at bay, even taking disasters into account.
Messed up real badly last week and just now.
Inadvertently left SSH open on firewall. Ne'er-do-wells entered and had a Most Frolicsome Time. I'm just glad it was not cryptolocker etc...
Anyway.
We picked it up when our support officers could not login to the site. Bandwidth usage was swamped.
Added some rules to kill off most IP addresses. That left us with some leeway.
But now another server at another location (which was a trusted host, and one you can SSH to) is showing the same - overutilizing of bandwidth.
Fun way to start your week.
And I have learnt the SSH lesson the hard way now. Never, ever make it publicly accessible.
We still have to decide on the going forward.
Stuff appens, you dust off, learn and move on. Something more than just leaving the port open must also be your problem; have a close look at sshd.conf
If you disable logins with passed, prohibit root login, and use only pre-shared keys the security of SSH is pretty tough to beat. Yeah, there have been some zero days in OpenSSH but I can say that about a lot of software.
For bonus points I use a nonstandard SSH port for my development and production environments. It doesn't increase security of the protocol per se, but the Chinese robocall activity on port 22 no longer obscures my logs. Now anything that tries to hit my ssh did so after someone did a proper port scan, which obviously makes me sit up a little straighter and think about it.
Live not in fear, SSH can work for you... The vast majority of the time... But configuration is key
If it's a Linuxy box, check out Fail2Ban. Dynamically creates iptables rules on receiving bad logon requests (or whatever other criteria you select in the sshd.log) at whatever frequency/time interval you choose.
I used it for Postfix, for dropping SMTP connections that were attempted more than three times in a row from hosts that were blacklisted in our RBL - those got banned for 6 hours. Also, hosts that attempted more than 20 messages in 5 mins to "unknown recipients" - they were dropped for 2 hours, I think - a cheap person's DHA throttle.
Someone doing some "futzing" with a test script sending random-bit packages towards a new IPv6 stack, set off a CISCO-bug that hard-locked the routers on the corporate network. We noticed because suddenly the gym started to fill up with testers evacuating from the "zone of responsibility", before the arrival of the local BOFH.
Many BOFH's had to travel back on a late Friday afternoon to power cycle the routers in several locations to get them back up.
However, the vendors of remote-controlled PDU's were happy and CISCO were very happy about the bug. Maybe the BOFH's liked the overtime even while being unhappy.
...one of the very best techniques I've found for stress testing electronic hardware is power cycling. Do it hot, do it cold, do it at low and high voltage, do it while undergoing thermal shock. Do it at all corners of the design envelope. Use a big, ugly mechanical contactor with contacts that bounce like a hyperactive kid on speed. Add ridiculous amounts of line inductance. Put in parallel with a big, ugly motor load. Switch that on and off violently as well. Stuff will die, horribly. HW engineers will whine, switchmode supplies will scream. Mod design as needed and rock on. Very quickly you've got a much more reliable system.
So our AS/400 hero is really just helping beta test IBM HW... Just without the thanks.
CPU cooker
I was having a look inside a PII 350 PC. A clone machine with no badge on it and i've no idea where they got it from. PC was on, top off and I may have slightly nudged the CPU with my wrist.
Fast forward a couple of minutes just hung up the phone and smelt burning. Turn around and noticed said PII 350 was smoking, cue rip cables out the back and get the PC out the fire escape without setting the fire alarm off (good thing the fire escape was just across from the IT office, hmmm was that planned?).
After it had finished smoking I notice that the CPU didn't have the clips to properly secure it to the retention bracket, oops.
Visiting a remote site for a printer problem, they had a all in one system that rather strangely had Windows Server on it. Couldn't work out why the printer wasn't working, so decided to reboot the PC - give me a chance to ponder my next move while stroking my beard.
Turns out it was an RDP session to the main server! And when it rebooted the database services didn't start up automatically.... So I messed up every GUM clinic in the county....
When I returned to the main office, I pointed out that I didn't so much reboot the server but did some "unscheduled data resilience testing - AND YOU FAILED!" Due to that, logged on generic user accounts can no longer reboot servers and the services now start up automatically.
....the only thing I've done that affected everyone was with a certain MFD solution. It has "follow me printing". You print to one queue and can sign on to any MFD and print. All good until one day I thought I'd make it more secure. Let's turn on the option that purges print jobs when you log off.
A day later I started to note calls coming in that prints were half printing and then stopping. I took the calls and then noticed more coming in.
Then I realised what was happening. The issue was people were signing on to print, choosing print, then instead of waiting for the job to finish where signing out and sometimes their print was so long the MFD would auto sign out. Oh dear. That meant it would then purge their print job and they'd only get half of it.
Luckily I'd noticed all the calls come in, grabbed them, fixed the issue (turned off the option) and all went quiet again. All before any management had noticed.
For the end users, I just made up an excuse why it was happening. But that excuse never, ever blamed them. I hate engineers that do that.
This post has been deleted by its author
Once knew a fella who usually worked in a small data center where the staff door exit release was a big green button.
The data center that he rarely went to on the floor below had a similar looking button, but was for emergency power down.
You can guess what happened...
This post has been deleted by its author
On *nix machines, the mollyguard package installs a set of wrappers around shutdown, reboot, poweroff, etc. If it detects that you are inside a ssh session, it will ask you for the name of the box you intend to shutdown. It refuses to shutdown if you aren't on that box.
It doesn't normally intervene for console and desktops, so be careful with KVMs
Years ago I was taking a new sys admin on a tour of our machine rooms. We visited the brand spanking new one when they ask what the bit red button, size of a melon, marked FPO was. I jokingly said, "I don't know... hit it." And...they did. The room went pitch dark ...it was the fire emergency Full Power Off. "What do we do?" they asked. I said "Hit it again" ... lights came on ... "Now Run". We did. Luckily the room was not fully online and none of the servers were production yet....
I was consulting on an SAP project a while back and there was a configuration master (one of many development) server which had root ssh trust to all the other SAP machines (yeah not recommended from a security standpoint these days but like I said it was a while back and the client's lead sysadmin set it up)
One day one of their guys isn't paying attention to exactly where his shell is executes a reboot command on the production server. Thankfully it wasn't in production yet but there was a lot of qualification work going on so it stopped a full team of D&T guys in their tracks who were probably billing $5000/hr collectively, plus at least 50 of the client's employees.
I created a root alias (easily distributed to all two dozen servers thanks to the ssh trust) that aliased 'reboot' as 'echo "must use reboot`hostname`"'. Then an alias for reboot`hostname`. So if you want to reboot the server fred you had to type rebootfred. No more worries about someone rebooting the wrong server.
A fellow admin I worked with 10 years ago was told by a senior manager he needed to immediately know the last three times a critical production E25k domain was rebooted. Admin logged in to the domain mid-day and issued,
last grep reboot
Which we all know is not the same as,
last | grep reboot
Or as safe as,
last | grep boot
We wound up having to open a case with Sun to determine why the domain ‘crashed’ mid-day. They never found the root cause....
Data centre in a large Australian stockbroker, one of 3 distributed around the country, and on the wall next to the door exit button was a big emergency shutdown button . After a contractor had finished in the room he accidentally hit the wrong button to get out.
Shortly thereafter, a cover cage was fitted :)
AS/400's have the ability to Passthrough to another system I think it was called so you are logged onto a console but can bring up a console for another system. I, thinking I was in the training/test server did the same thing for a VERY large manufacturer here in Canada and shut down the plant.
I was the Infrastructure & Operations Manager at the time and there was no IT VP so I was temporarily reporting to the president who took it all in stride, ran interferance for me and averted some angry plant managers. I learned a lot about management that day and what real leaders do in a crisis.
To this day I use it as an example of "I need the truth to be able to do my job, almost any mistake is excusable once as long as I know the truth".
One of my first roles was as an Oracle developer. I used to have administrative rights to the databases, that were located on these really massive Vax servers. They all had an initial after them to denote which server it was. You only saw that on the command prompt.
I also handled database operations, under the sometimes watchful eye of the main IT manager (and main DBA). I spend most of my time in SQLPlus so never really saw the command prompt - and would frequently connect to other database instances from within anyway.
I'd been working on a complicated set of financial reporting - that was written in SQL stored procedures. I was running it and noticed that the query wasn't very efficient (lots of joins etc) so decided to optimize the database on development with the inclusion of an index.
I was a bit tired after a heavy night drinking the night before... and after a few checks on the script - I set it going - expecting it to finish immediately (we didn't have much test data) but the query went on and on... As the CPU on my Mini had maxed out - I lost my terminal as it became unresponsive. The phones on the IT floor started to ring - and then realized I'd triggered it on production. No-one could work and I had killed the whole site accessing the production platform. The DBA rushed into my room...
Worst was that it took half a day to roll back and an all company memo blaming it on a pulled cable in the data center room (cough)
At a place I used to work at, I recall one of the helldesk guys telling a user on-site that they would need to hard power off a server that had become unresponsive. "Press and hold the power button on the bottom server in the rack" was the instruction, and shortly after everything stopped.
Said helldesk guy forgot to take into account that to moast users, a UPS looks like a server, and the bottom device in their rack was the UPS. Oops !
But seriously, I reckon there are 10 types of IT person: Those that have accidentally shut down or powered off something, and those that are lying when they claim that they haven't !
But seriously, I reckon there are 10 types of IT person: Those that have accidentally shut down or powered off something, and those that are lying when they claim that they haven't !
I'm yet to kill a server and take out anything like that (I'm still young, there is time), but it's not uncommon for me to issue a shutdown command on my own PC, and 10s later remember something that I needed to do...
I was at a defense outsourcer in Texas 20 years ago as a sales engineer selling system management software, and was told an amusing story about job control GUIs. One day they had put in a fancy new GUI in front their LPARs, to make their lives easier. The senior guy spun up an LPAR (or whatever you called it), ran some jobs, then shut it down.
Well, it kept asking him over and over if he wanted to shut down the LPAR, and he got annoyed and just started clicking "yes." It turns out he shut down like 18 LPARs by mistake. He only stopped when someone ran into the room panicking, asking if a catastrophic event had happened.