Why didn't you eject the CD player using Windows Explorer before touching the server?
Sysadmin left finger on power button for an hour to avert SAP outage
Welcome to the seventh instalment of Who, me? The Register's new column in which readers share stories of the times they broke stuff without any help at all from users. This week, meet "Jeremy" who back in 1999 scored his first "real" IT job "as part of a team sent out to run the IT at a big publisher." Said team was working …
COMMENTS
-
-
-
Monday 5th March 2018 08:20 GMT Lee D
Nope, but they do come with ID lights.
It's a really dumb thing to press the button on the wrong server. And... if we're talking about an era where holding in the power button doesn't kill the machine hard in 5 seconds, and where NT is running, and where it doesn't auto-power-off on the Turn Off Your Computer screen, then we're back in the age of floppy disks and maybe even pre-CD in your average server.
But whatever era, there will have been a better way to indicate what server you mean rather than just guessing.
-
Monday 5th March 2018 10:35 GMT Anonymous Coward
a) PL1000 / 1500 did not come with CD ROM per default - they were optional and expensive!
b) the Y2K Updates were done by Floppy - ROMPaq Updates
c) Hostnames were put on brownish labels and consisted of 8 positions: 2 letters for the city and then 6 numbers. No clue what kind of server it actually was - the lists were all you had.
After several hours of too many dB, temperatures between 15 and 40 °C (Depending on where you were in the DC) - one tends to get a bit "unfocussed" - as was the case with my pal J here.
(Yes, I post this anonymously - but I am said "Jeremy" ;)
-
Monday 5th March 2018 12:22 GMT Emmeran
Jeremy spoke in class today
This same sort of thing happened to a friend and co-worker and we did indeed make him stand there holding the button in until we could the users out and the apps shut down.
I still recall that forlorn look on his face as he stood there alone in the data center, it brings a smile to my face.
-
Tuesday 6th March 2018 13:58 GMT Marco van de Voort
Moreover the default cdrom was not exactly standard. It was connected on the onboard SCSI (the IDE was connected to the floppy ?!?!) and the system firmware could only boot from devices that had a special (512byte sector emulation?) jumper on and used floppy emulation. In the mid 2000s the only distro that booted was Slackware 8.1
-
-
Monday 5th March 2018 12:41 GMT Anonymous Coward
"[...] there will have been a better way to indicate what server [...]"
Head came round the door "All yours". So I headed off to the machine room to do my testing on the cold stand-by comms processor. The console was mounted on top of the unit. Hit the keys for debugger mode - machine stops. Sudden howl of anguish behind me.
There were two comms processors. That day they had decided to use the stand-by one for the official acceptance time trials. Thankfully the presiding government official allowed that as a genuine mistake that did not affect the acceptance criteria - and the repeat run was ok.
After that there was a large notice on whichever one was the live machine.
-
-
-
-
-
Monday 5th March 2018 07:43 GMT Anonymous Coward
I did that once, not on a production server but home computer back in the days when a power off could kill it forever or severely mess up your next reboot. It teaches you a lesson about computer placement that you never forget. These days I have two pieces of cardboard wrapped in black tape over the top of the buttons because they are on top of the case.
-
-
Monday 5th March 2018 12:48 GMT Boothy
Re: New PC case.
I like my current Antec Tower case (had it years now, triggers broom ya knows).
It has a full height door on the front, hiding things like 5.25 and 3.5 bays (all unused these days), but it also hides the Power and Reset buttons.
I don't know if by design, or accident, but the edge of the door also has large (finger sized) air vent holes from top to halfway down, the bottom one of which lines up quite nicely with the buttons.
So no way to hit them by accident, but you can still use them without having to open the door.
-
Tuesday 6th March 2018 01:17 GMT d3vy
Re: New PC case.
"I have a wonderful mini-itx PC case. Only problem is the power button is on the top, just where I may rest something for a moment. Like a game controller or whatever."
I used to work in an office where there were two banks of desks fed from two sockets (with extension leads - but thats not the wtf here..) the sockets were at about the same height as the head rest on your average office swivel chair and positioned right behind someones desk.
The number of times that the power got knocked off started to get daft so the managements solution... Not to move the socket... not to change the sockets so there was no switch... Shove a few old PSUs under the desks to feed the PCs if the switch gets hit.
-
Tuesday 6th March 2018 16:22 GMT JimboSmith
Re: New PC case.
At home I have spike protected power strips on all my equipment. I did once have something that was fried when the power came back on after a power cut and am now slightly paranoid. Most of them have either no power switch or a recessed one to prevent you switching things off accidentally. The one place it does have a switch is the living room entertainment area (TV, DVD, Blu ray, Satellite receivers, CD, Amp etc.) My housemate was (out but) recording something she wanted to see and I didn't, so I thought I would tidy up the cables around the back. I was busy doing this when I heard the TV power off and standby lights which were reflected by the coffee table go out. I then spotted the switch that my knee had just hit which turned the blasted strip off.
The recording was now stopped and it would take a couple of minutes to get everything back up and running. So I switched off and then back on the power to her room at the circuit breaker and claimed it was a power cut. She only lost approximately 4 minutes of the tv prog but but I vowed then and there to replace that strip. I now have a 19 inch rack unit which has proper power distribution strips (with spike protection) screwed to the back. These have sunken switched to prevent accidental presses.
-
-
-
Monday 5th March 2018 15:39 GMT ma1010
Cats can be a problem
Years ago in college, we worked in groups, and one of my group mates had a cat. One Sunday he was at home compiling all his hard work and watching the cat play around the computer, "Aww, how cute..."
Then the cat nosed the big, red RESET button and thrashed his work. "Get out of here, you damned thing!" Not so cute, then, apparently.
-
Monday 5th March 2018 17:15 GMT Oh Homer
Re: I did that once
My only fatal "power off" incident involved a PC with a mechanically failing drive (bearings failure, I believe). It was borked anyway, but in my desperation to recover at least some data I made the mistake of turning it off to try some other method. It never spun up again.
Then there was the classic where I formatted the wrong drive, back before I knew that drives could be "un-formatted". Lost everything that day. I also gained a greater appreciation for backups.
There was one happy tale, though.
I had an Amiga 4K with a CyberPPC SCSI controller, which I'd been using without issue for years, until one day I decided to meddle with settings I didn't understand at all, in CyberPrefs (the SCSI controller's firmware settings). I came in first place for the Darwin Awards that day, and ended up with an unrecognised drive.
Just to absolutely prove how stupid I was, for some reason I didn't make the connection between my meddling and the fact that my drive had mysteriously disappeared. As my brain sunk deeper into hibernation mode, I gave up completely and bought a PC - my first ever in fact, from the sadly long defunct First Computer Centre in Leeds. So in a way, I have that SCSI controller (or my stupidity) to thank for many happy years playing Doom and Quake, then eventually getting fed up with Windows and switching to Linux.
Years later I fired up that Amiga 4K, went straight into CyberPrefs, changed back the incompatible setting, and rediscovered my "missing" drive, along with my long-lost youth.
-
-
-
-
Monday 5th March 2018 18:47 GMT Anonymous Coward
1996? I doubt that. Probably 1997 or 1998 for you.
ACPI only got released in December 1996, and the first PCs with ACPI sold in 1997.
Widespread adoption was only in 1998/99.
Windows 95 had no ACPI support, Win 98 came with disabled ACPI. Only Linux 2.6 and Windows 2000 and onwards supported ACPI. And those OS even disabled ACPI on pre-2000 hardware, as ACPI v1 was quite buggy.
https://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface
-
Tuesday 6th March 2018 02:52 GMT JBowler
Huh?
>1996? I doubt that. Probably 1997 or 1998 for you.
You know Steve, the Cynic, then, Anonymous Coward?
>ACPI only got released in December 1996
Duh..... duh..... Like, someone developed it dude.
Quoting from Wikipedia just proves you work in a troll farm for putin.
I don't know Steve, the Cynic, but I do know what I was doing in December 1996 and it certainly wasn't released until some time in 1997.
-
-
Tuesday 21st August 2018 01:04 GMT StargateSg7
My old Compaq 386-series did that and that was late 1987 or so. You could shut the computer itself down completely via software by some pushing some values into the x86 registers and calling an interrupt which was NOT part of the MS-DOS or IBM PC-AT BIOS standard INT calls. AND if your terminal display was SCART compatible like ours were (they were basically industrial-grade 20 inch Sony Trinitron TV's used as computer displays with 800x600 pixels of resolution), we could even shut down the monitor from software in 1987! ..SOOOO.....this isn't new technology.
-
-
-
-
Tuesday 6th March 2018 08:10 GMT Montreal Sean
It'll work fine
"It'll work fine. You just need the users logged out, and everything shutdown within 4 seconds. A challenge, but perfectly achievable for a true boffin!"
Bah! Users should be saving their work every 10 minutes or less.
If they didn't, too bad for them.
Ok, I may be a bit of a bastard...
-
Tuesday 6th March 2018 14:29 GMT Anonymous Coward
Re: It'll work fine
I worked in a college with a CTO who thought like that. On multiple occasions I saw him cheerfully reboot terminal servers, kicking all users out without warning, to 'fix' stuck print queues. On examination days.
I'd then get lumbered with the job of helping anxious teachers fill out the 'extenuating circumstances' paperwork for the roughly 10% of kids who'd basically been given an instant exam resit. Bafflingly, the college never lost its exam centre status, and the CTO was never disciplined in any way, as all the other bosses seemed to view it as an act of God.
-
-
-
-
Monday 5th March 2018 09:12 GMT alain williams
Typed 'Reboot' where ... ?
Telnetted into various Unix machines, wanted to restart the one in the server room. Whoops - I forgot which machine I was logged into and typed 'reboot' to a machine on the other side of the planet. It did not come up, had to wait until teatime for the guys there to come in and push a button :-(
-
Monday 5th March 2018 09:18 GMT wyatt
Re: Types 'Halt' where ... ?
I can hold my hands up to that one as well. SSH to a server then a workstation, 1 letter (c/s) difference between them and I wasn't on the client.. Fortunately it was a reboot rather than a shutdown and it also happened during another major outage so the impact was minimal.
-
Monday 5th March 2018 10:27 GMT Chris King
Re: Types 'Halt' where ... ?
I was on the receiving end of that once.
An acdemic had moved to another uni, and my opposite number at the new uni was helping him transfer his files from our OpenVMS machine to theirs.
In another telnet window, the IT bod was logged into his test OpenVMS machine, preparing to test a patch for a nasty little crash bug that anyone with telnet/SSH access could trigger - no extra privileges required.
Yes, he got the two windows mixed up, and our box dropped dead.
Boom, crash dump and P00>>> prompt at the system console.
Fortunately, it happened in the middle of a change window, ironically to install and test the very same patch. Still, it's not really the sort of thing you want to see when logged into SYSTEM at the console and installing patches.
The other guy phoned me a couple of minutes later and 'fessed up to his mistake.
-
-
Monday 5th March 2018 10:05 GMT Aitor 1
Re: Typed 'Reboot' where ... ?
A work colleague (admin) lost his job that way many moons ago.
Hethought he was putting my code (well, a version update of the project I lead) into the integration environment.. but put it into production, as he had both terminals open, and made the huge error of pulling from command line. I had told him before to put into the server, and execute from the server.. as a friendly suggestion.
He was lucky in the sense that there were no bugs in the code, so in a sense the systems kept working, unlucky in the sense that this was in the client/server era, so the decision was to push the updated client. 45 minutes down time for 50/100 ppl (dont remember well).
He lost his job for a single mistake in two years, I am still a bit angry about that.
-
Monday 5th March 2018 12:15 GMT Evil Auditor
Re: Typed 'Reboot' where ... ?
He lost his job for a single mistake in two years, I am still a bit angry about that
Angry that he made this single mistake or angry that he lost his job?
Depending on the type of business 45 minutes downtime may or may not be reason for dismissal. Apparently, more than 5 hours of downtime for approx. 90% of the staff of about 20k (a bank) was no reason for dismissal. Than again, it wasn't due to an operating error but a management decision to implement a half-baked release.
-
-
Monday 5th March 2018 16:40 GMT Cynic_999
Re: Typed 'Reboot' where ... ?
"
Indeed, seems a bit of a stupid decision to me, especially as the fired guy is definitely going to be the one person you are certain would never make *that* mistake again.
"
The same might be said of a driver who accidentally hits the accelerator instead of the brake and ploughs into a bus queue. But I guarantee he would lose his licence at the very least, and be lucky if he escaped jail. Usually it is the severity of the act that is punished, but sometimes the consequences of a simple mistake are so severe that they are taken into account as well.
-
Tuesday 6th March 2018 12:20 GMT Wayland
Re: Typed 'Reboot' where ... ?
Cynic_999 "accidentally hits the accelerator instead of the brake"
It's not the same at all because the driver is using all the controls constantly with no problem. To catastrophically make three errors all at one and persist with those errors until people are run over is nothing like being a bit late on the brake peddle.
There is a video where a man is chased into a layby onto the pavement by a bus which smashes through the front of a shop. The man managed to escape but the driver claimed he hit the wrong peddle.
From what I can remember about driving (have not driven since yesterday) you don't hit the peddles with your feet you gradually press them to cause the amount of acceleration or deceleration you need. You begin doing this in plenty of time and you can press the peddles harder if you need more effect.
In a rack of servers it's an easy mistake to be looking at the wrong server, hence the little button that lights up so you can figure out which one you want to work on.
-
Tuesday 6th March 2018 15:39 GMT Cynic_999
Re: Typed 'Reboot' where ... ?
"
From what I can remember about driving (have not driven since yesterday) you don't hit the peddles with your feet you gradually press them to cause the amount of acceleration or deceleration you need
"
It is easier than you might think.
Imagine that you are closing slowly with the car in front. So you press gently on the brake but you see that you are still closing with the car in front. So you press a bit harder - and see the gap is now closing *really* fast, so you panic and jam the brake pedal full to the floor. Only later do you realise that your foot had been on the accelerator rather than the brake.
Or while stopped you start reading a text on the phone in your lap when out of the corner of your eye you suddenly see that your car has started slowly rolling forward because you forgot to set the handbrake. Sudden adrenaline rush and panic, you stamp hard on the brake to stop the car before it rolls into something - except it isn't the brake.
-
Tuesday 6th March 2018 15:40 GMT Cynic_999
Re: Typed 'Reboot' where ... ?
"
There is a video where a man is chased into a layby onto the pavement by a bus which smashes through the front of a shop. The man managed to escape but the driver claimed he hit the wrong peddle.
"
You really think the driver did it deliberately? You have obviously never reacted in a panic.
-
-
-
Monday 5th March 2018 13:28 GMT I am the liquor
Re: Typed 'Reboot' where ... ?
@Evil Auditor
Depending on the type of business 45 minutes downtime may or may not be reason for dismissal.
If 45 minutes downtime is that much of a problem, then sacking the tech who caused it by a simple finger fumble is nothing more than scapegoating. More reasonable would be to sack the executive who failed to put in place systems ensuring a simple human error couldn't cause such a serious problem.
-
Monday 5th March 2018 18:38 GMT Mark 85
Re: Typed 'Reboot' where ... ?
More reasonable would be to sack the executive who failed to put in place systems ensuring a simple human error couldn't cause such a serious problem.
In a perfect world, yes that would be the right thing to do. In the real world, the execs protect each other and everyone else is cannon fodder and/or scapegoats.
-
-
-
-
-
Tuesday 6th March 2018 04:35 GMT keithzg
Re: Typed 'Reboot' where ... ?
molly-guard has definitely saved me more than once. I don't have it installed on *all* the servers, but I sure as hell do on the servers where it would matter . . .
(That being said, if things are fragile enough that a clean reboot is a big problem, things are probably too fragile.)
-
-
-
Monday 5th March 2018 13:02 GMT Boothy
Re: Typed 'Reboot' where ... ?
I used to use PuTTY a lot on Windows, into *NIX boxes, back then we had direct access to boxes. So we just set up each environment with it's own custom colour and Widows title settings. Green, you're on a Dev box, Red - prod, etc. Nice and easy.
These days we have to go via jump boxes, and are usually on Linux laptops. So it's all basically one shell, same colour for text etc. But someone did tweak all the Red Hat boxes, so if in a live environment, the user and server name all have a red background colour at the prompt. (The name@server: bit).
Still doesn't help if you're on the wrong prod box, but at least you are less likely to run something not prod friendly.
-
-
Monday 5th March 2018 13:34 GMT Anonymous Coward
Re: Typed 'Reboot' where ... ?
Yeah, I feel your pain on that one.
I once managed to balls-up a firewall rules tweak on a Linux machine in our new branch office in Sydney, accidentally removing a critical rule and therefore cutting on my own comms from said machine. Muppet.
Had to wait for local staff to arrive in the office and talk them through restoring remote access via the console.
-
-
Monday 5th March 2018 17:47 GMT Stevie
Re: color background
Agree. I have profiles for my terminal software for about a dozen or so foreground/background combinations. If I have something that cannot be the subject of a mistake, it goes in the white on red window.
The youngsters in my office laugh at this and would rather use other (unapproved) software that either doesn't offer a way to store multiple profiles easily or that they can't be bothered to learn how to use properly.
One of the bright young things, working in a forest of white on black consoles, restored pages from a test database over our production database and caused a complicated partial outage that lasted a week while we sorted it all out.
Another Young Genius obliterated a QA cluster under the impression he was working on a dev system.
Yep. The problem is The Old Guy doesn't "get it".
-
-
Friday 9th March 2018 11:22 GMT regadpellagru
Re: Typed 'Reboot' where ... ?
"Telnetted into various Unix machines, wanted to restart the one in the server room. Whoops - I forgot which machine I was logged into and typed 'reboot' to a machine on the other side of the planet. It did not come up, had to wait until teatime for the guys there to come in and push a button :-("
Who hasn't done this one, I wonder. Happened to me as well: wanted to reboot my SUN workstation, so typed "reboot", then I had "end connection" on that very window ...
Got me quite pale for a moment: I didn't know which system I so rebooted and I was logged to quite a lot !
Then colleagues told me every workstation had frozen: I was logged to the NIS server, which, fortunately came back 30 s after ...
-
-
Monday 5th March 2018 09:29 GMT chivo243
Little fingers
Once I was working (playing a game actually) in the office, and my boy (I think he was 4 at the time) comes rolling by, and and says what does this light do papa? as he's pushing the power button!! Needless to say, my gaming session was ended, and that PC never seemed right after that incident.
-
-
Monday 5th March 2018 17:02 GMT Anonymous Coward
Re: Little fingers
I think I can top that one - at security in the airport waiting to get on a plane - my little one accidentally shut off the entire scanning line when he turned off a power bar - had to wait 15 minutes while everything rebooted and then they had to rescan everything....when I left homeland security was looking at the power bar and talking about how to prevent such an incidence again...
-
-
-
Monday 5th March 2018 09:52 GMT &rew
Fast fingers
I recall for old PCs that used an actual mains voltage power button, if you pressed the power button in, and then really quickly popped the switch out and in again, there was enough smoothing in the power supply to cover the momentary blip. True, I would not be willing to attempt that on a company server, though...
-
Monday 5th March 2018 13:39 GMT Oz
Re: Fast fingers
I have saved myself from a thorough dressing down doing just that back in the mid 90s. I held the power button down on a server to force a power off, realised it was the wrong server, thankfully before letting go again and, after several minutes of holding the button in and deliberating, was able to release and re-press the button before the power dropped out.
-
-
Monday 5th March 2018 10:27 GMT Tim99
I guess that beats my post
From last month: my idiocy was only going to trash my work...
-
Monday 5th March 2018 10:40 GMT Remy Redert
Ever since I had a cat induced computer outage when one jumped onto the case and sat on the power button, I've taken to the simple expedient of not connecting any of the buttons on the case, setting the machine to start when the power comes on. The big switch for the power bar is much less sensitive to cat induced failures.
On a related note, which idiot of a designer decided that buttons should be put on the top of the case, where they're hard to reach if the case is in any kind of enclosure and easy to set off accidentally if they're not?
-
Monday 5th March 2018 10:57 GMT DuchessofDukeStreet
Which Idiot of Designer?
The one who recognised that most office users would end up with a large box sitting beside their legs under their desk - buttons on top are the most accessible from a seated position (assuming you're talking about a vertical unit).
For horizontal ones, on top still makes sense as it prevents them being knocked accidentally for objects being pushed around the desk surface.
But also one who doesn't own/is owned by a cat, and doesn't recognise their tendency to jump onto any available (and inconvenient) surface, particularly one that's radiating heat.
-
Monday 5th March 2018 11:20 GMT graeme leggett
Re: Which Idiot of Designer?
I have exactly that sort of machine (a Dell OptiPlex "designed" for office use) sat beside me and occasionally I nudge the power button with my knee. Fortunately this is set to initiate a hibernation rather than shutdown.
I have experimented with putting some of those flippy lid button covers over the switch - held on with double sided tape due to location at the top corner of the front bezel. Short of dismantling the front and getting busy with glue and screws the fix is far from permanent.
-
Tuesday 6th March 2018 01:39 GMT d3vy
Re: Which Idiot of Designer?
"I have experimented with putting some of those flippy lid button covers over the switch - held on with double sided tape due to location at the top corner of the front bezel. Short of dismantling the front and getting busy with glue and screws the fix is far from permanent"
Pull the side off and disconnect the button completely.
Then either buy a replacement button that can be positioned at the back of the PC or set the machine to wake on keyboard so you no longer need a physical button on the case.
I have mine set to boot on power resume and everything on the desk is plugged into a 5 way surge protector so that when I flick the mains switch everything comes on at once.
-
Wednesday 7th March 2018 00:04 GMT Alan Brown
Re: Which Idiot of Designer?
"Fortunately this is set to initiate a hibernation rather than shutdown."
Assuming Windows, go into the power settings and change "when the power button is pressed" from whatever it's set to, to "ASK"
It's not that difficult really - and there are similar settings in most *nixes (even if you have a CLI-only system)
It won't help you if you have an old style single PSU server with a real power switch on the front, but the "switch" on ATX systems is merely an input device and you can change its functionality.
Just don't do what someone I know did and swap "power" (big button) with "reset" (needed a pencil to press). Reset means RESET and having a wayward cat hit it is more of a problem than having the power go off.
-
-
Monday 5th March 2018 12:24 GMT Prst. V.Jeltz
Re: Which Idiot of Designer?
The one who recognised that most office users would end up with a large box sitting beside their legs under their desk - buttons on top are the most accessible from a seated position (assuming you're talking about a vertical unit).
For horizontal ones, on top still makes sense as it prevents them being knocked accidentally for objects being pushed around the desk surface.
Nah , sorry but all of that is bullshit . Buttons go on the front of things - end of . A user with a tower box under there desk will of course instictively look on the front of because - thats where buttons go . Yes , it may be *physically* easier to put it on the top , but its still bloody stupid. cos: cant put anything on top of it , what if theres a shelf above it . people dont look there, just as easy to accidentally push, etc , ad infintum. (this is why top loading VCRs died out? )
Your middle paragrah makes little grammatical sense but I think the gist of what you were getting at is covered above.
Your 3rd paragraph is of course correct, cats will sit on warm things , they will also jump on the desk itself and get between you and your game of Farcry in an effort to get fed. The more fiendish ones will do this by standing on F5, which you have assigned to "Load last saved game" :(
-
Monday 5th March 2018 12:27 GMT Prst. V.Jeltz
Re: Which Idiot of Designer?
The power button on my home box , apart from being in prime position get get toed when resting foot on the shelf its on , has become a bit sticky and will tend to stick in when used which causes a kind of hernia / stroke in the BIOS . It takes a skilled touch to use it now - im not looking forward to having to explain that to someone over the phone in some sort of emergency .
-
-
-
Monday 5th March 2018 13:40 GMT John Stirling
@automatic power on
...I've taken to the simple expedient of not connecting any of the buttons on the case, setting the machine to start when the power comes on. The big switch for the power bar is much less sensitive to cat induced failures....
I used to do that, until the local power company decided to have an outage, which came back on 1 minute and 45 seconds later, and then went off again at the 2 minute mark, before repeating. For 26 hours over the weekend.
Which taught me a couple of things;
1) think hard before enabling auto on after power outage;
2) always use UPS on anything you care about.
3) Fridges also benefit from UPS.
Surprisingly a large percentage of the dozen of so PCs survived that little incident, although a number did not - and the Fridge needed a new motherboard!
-
Wednesday 7th March 2018 00:15 GMT Alan Brown
Re: @automatic power on
"Which taught me a couple of things;"
Due to many such episodes, $orkplace has a trips on all the server room power to ensure that if the power goes off, it STAYS off until manually reset. There are similar setups on all the AC systems. You have to manually power up.
In the old days I would have put any critical (must be up) systems on a startup timer of 5 minutes or so to ensure the power was stable before booting (that includes UPS inputs, I've seen a couple fried by dirty power when it was restored)
Whilst you can do this using bios delay timers it's not ideal in a lot of cases (drives don't like being spun up/down repeatedly) and there are smart distribution panel controllers around these days which take it a few steps further, with things like a selectable startup delay coupled with longer lockouts if they detect several power failures in a row.
-
-
-
Monday 5th March 2018 11:01 GMT Bob Wheeler
Repetitive work on multiple servers
I was working on a 16-node Novell Cluster, updating drivers. A process that had been done many times and non invasive and with no loss of service so deemed by management as safe to do in working hours.
The process was simple, take a node out of the cluster - “CLUSTER LEAVE”, copy the new device drivers and then reboot that node - “SERVER DOWN”, wait for it to start up and re-join the cluster, and move onto the next node.
By about the 14th or 15th node, after typing the same commands time after time, instead of typing “CLUSTER LEAVE” to take the node out of the cluster, I typed “CLUSTER DOWN”.
It should be noted that Novell does NOT ask “Are you sure?” when you type such a command, and it does what the command suggests it does – instantly. All users, potentially some 4,500 of them suddenly lost their file shares, email, printing, internet access – the works.
My only saving grace was it was late afternoon on a Friday so there was not that many users actually affected.
-
Wednesday 7th March 2018 00:17 GMT Alan Brown
Re: Repetitive work on multiple servers
"All users, potentially some 4,500 of them suddenly lost their file shares, email, printing, internet access – the works. My only saving grace was it was late afternoon on a Friday so there was not that many users actually affected."
We have a policy of warning users when work is happening. They're a lot more forgiving if they've been given a heads-up
-
Monday 5th March 2018 11:25 GMT JeffyPoooh
What about Power Failures?
"UPS" you scream.
No, I'm referring to the power failure caused by the UPS catching fire, ...again.
A well designed database would have journaling at the transaction later, and more journaling again at the FS level. Oh, sorry. SAP.
My buddy runs the IT for a company. He tells me that the server can have its power cord yanked out, and the backup server in his basement at home will complete the transactions, transparent to the users. They run in parallel and his done something clever at the networking level.
-
-
Monday 5th March 2018 14:07 GMT Anonymous Coward
Re: What about Power Failures?
cluster when you destroy one server ... and the system keeps goings
HP Non-Stop. Check out the price, then come back after you've recovered.
BTW us in telecoms have had active / active standby for a very long time, its how we roll.
Upgrades? No problem, upgrade "non-live", flip, upgrade old live.
100% of calls and systems still live.
-
Wednesday 7th March 2018 01:22 GMT Alan Brown
Re: What about Power Failures?
"BTW us in telecoms have had active / active standby for a very long time, its how we roll"
Which works really well, until it doesn't.
At which point you may discover that whilst the running systems were ok, what's in the configuration (and has been backed up to tape for the last 2 years) is scrambled. So if you reboot one controller after the other when applying your y2k fixes, you find your NEAX-61E has forgotten that it's a telephone exchange - and that after spending 2 days finding a working backup (3 years old), you then have to replay every update made from that point - which takes 6 weeks - and means that a large number of your customers can't be sure from day to day what their phone number might be - or even if they'll have dialtone.
Yes, it happened.
-
-
-
Monday 5th March 2018 11:38 GMT OzBob
Came close myself just today
What bright spark decided to allow keypresses on VSphere Client to perform menu actions? So if you don't properly focus on the console, you can type away and get prompted for "do you want to shutdown"? Fortunately I looked up and saw that before I got too far, but it was close.
-
Monday 5th March 2018 11:47 GMT ysgubor anhysbys
database reboot
Our sys admin was doing some maintenance on a replicated database, he had stopped the slave and made the necessary changes and then hit the power button to do a hard reset... unfortunately, the power button belonged to a different server - the live database master. Some how we got lucky and our 3TB of data survived.
-
Monday 5th March 2018 12:07 GMT Anonymous Coward
Probably my fault for being unclear
I used to manage "the UK's Most Dubious Beowulf Cluster", 80-some Pentium 4s running a scheduling job one each of them that waited for a text file to tell them what simulations to run. Not the world's most brilliant solution (especially since they used a regular user's account), but it worked well enough.
One day, I was having trouble with my email, probably because Outlook Exchange was a delight back in the day, and our Scottish helpdesk were very helpful, doing all the things they needed to do to fix it until, without warning, they said "Right, your new password is...".
While I was logged in to 80-odd Pentium 4s that suddenly had outdated credentials and thus, no LAN access. Cue me and a room full of KVM switches, re-logging dozens of machines and restarting failed simulations.
On the plus side they did fix my email.
-
Monday 5th March 2018 12:25 GMT HPCJohn
Re: Probably my fault for being unclear
Talking about Beowulf clusters.... A several of years ago I was at a customer site in a big UK company which may or may not build jet engines.
Stood at the console of said machine, I wanted to reboot one of the servers in the cluster. I was telnetted into one of the servers in the cluster and wanted to reboot it. I go ahead and press the Vulcan Death Grip - ctrl-alt-del. Only the whole shooting match went down, not the server I was logged into. Cue red face from me. But they were very good about it.
-
Monday 5th March 2018 12:29 GMT Greg Stovall
Silence is NOT golden...
Back in the 80s, I was on a coop term at a major telecommunications manufacturer. My assignment for the summer was to port a wire wrapping program from an DG Eclipse to an HP 3000. It was a very enjoyable exercise writing a converter from RATFOR to Fortran 77.
The factory floor was quite a noisy place with all the manufacturing equipment. Since I was new to the HP 3000, I spent a little time exploring. Discovered that as administrator, I could actually poke any memory location directly. I experimented with this...then noticed it was quiet --- too quiet. Panic filled my soul when I realized that the HP3000 I was poking on was the same one that ran all the manufacturing equipment -- and I had crashed it in the middle of the work day.
I learned NOT TO POKE memory on the HP 3000...
-
Monday 5th March 2018 12:42 GMT Anonymous Custard
The hardware version...
I take your server shutdowns and offer you a colleague doing it on a semiconductor manufacturing machine (of course in the middle of running 150 production wafers). Needed to power down machine A in a bank of them to work on it, so goes around the back and accidentally hits the power button on machine b beside it. Bye-bye 150 product wafers towards the end of their production flow, in all worth a many thousands of dollars.
We are now strictly verboten from even touching any machine which doesn't have clear ID labelling (customer responsibility to add those, the ones above didn't) and even then we have to point and say plus buddy-check. This is not to say that it hasn't happened since these measures were introduced of course, given some of my colleagues and the old adage about idiot-proofing...
-
Monday 5th March 2018 16:32 GMT Anonymous Coward
Re: The hardware version...
I know that feeling...
I work for the supplier to a semicon lithography systems maker. They use 4 number (hex) machine identifiers. They're not sequential but can be quite similar and a typo is easy to make when remotely accessing into a system. I may or may not have shut down the wrong system for service at some point... Luckily this was at the manufacturers fab and not a field system though. Working on field systems always makes me nervous given the dollar amounts involved.
From experience it's also not easy to explain to a customers line manager at 9pm that you broke his system some more instead of fixing it like you were supposed to.
-
-
-
Monday 5th March 2018 14:14 GMT Anonymous Coward
And me!
Dec Alpha workstation, while visiting a physics department in Oregon in the early 90's. Having finished up late while running some simulations, I confidently reach down to the power bar and remove the power brick for my portable CD player .... and the workstation suddenly goes off.
I'd also nudged the adjacent switch at the same time. Oops.
-
-
Monday 5th March 2018 13:06 GMT Rufus McDufus
Emergency power off
First job working in the comp sci department at a well-known technology-focused university in London. Annually we'd show prospective students around the facilities including the server rooms. There were big red emergency power-off buttons in various places. A particularly tall budding student decides to lean back against the wall and... These were the days of IBM 4331s, various DEC servers, a big ICL mainframe and others. Generally things didn't tend to work well after a sudden power-off.
-
Monday 5th March 2018 13:27 GMT Anonymous Coward
Toggle power switch
This story is strikingly similar to an anecdote from colleagues at a previous job, when one of the ops guys went to power off a server and was informed as he pressed the switch that it was the wrong machine. Although in roughly the same time period, I'm certain its not the same incident because out site never ran SAP.
The box in question was running end-of-day batch processing, so could not be allowed to power off otherwise carnage would be caused.
Unfortunately the recessed nature of the switch meant that nothing could be jammed in to replace his finger without also releasing the button at the same time - so he was forced to stand there in the comms room for the next two hours or so, with the end of his finger going blue, waiting for the batches to finish so that the machine could be gracefully shut down.
In a separate incident, another colleague at the same site had apparently stepped into a hole in the floor of the comms room where a tile had been removed ('elfin safety??) - reached out instinctively to stop his fall, but found he'd hit the emergency power-off button on the side of the AS/400 ... oops!
-
Monday 5th March 2018 13:36 GMT AndersBreiner
The Big Red Button
I was once working on a web application, back in the .com boom. We had a production server which was heinously unstable. We'd test on our dev server for a week and then send stuff over to the production one. The production one was administered by another company and we'd call them and tell them how to do stuff. Either this was before the days of VPN or they didn't want to allow that, for reasons that will become clear.
Anyhow I was in an interminable call with them.
"Ok, you've got the files unzipped"
"Yep"
"Right click on the .reg file and add it to the registry"
"It crashed"
"What do you mean crashed? Did you add it?"
"No it crashed when I right clicked"
"How did it crash?"
"It said explorer.exe performed an illegal operation"
"Well that's odd, isn't it. Try to open up this folder"
"It crashed again"
"Ok press the Windows keys and R and type"
"Crashed again"
"Let's try a restart"
At this point I hear a load clunk, then a pause then another loud clunk
"What was that?"
"We're restarting"
"Don't you do that through the start menu?"
"No, it always hangs when we do that, do we just use the big red button"
And then I worked out why nothing new ever working in production and why things that used to work stopped - the server was so addled at this point that it couldn't reboot without someone power cycling it. It'd probably gone through hundreds or thousands of hard power cycles. This was NT so it was somewhat robust but you did lose data on a hard crash - any files that were open for writing would be corrupted and sooner or later you corrupted something vital.
-
Monday 5th March 2018 13:45 GMT Joseph Haig
Going Dutch?
Is this what the Little Dutch Boy is doing now?
-
Monday 5th March 2018 13:53 GMT wallyhall
I remember doing that once
It wasn't on a server though - just when we were kids at secondary school. Guy sat next to me thought it'd be funny to "hold my work to ransom" by pressing and holding the power button on my PC. I quickly pressed my finger onto the button next to his, and discovered that if you release and press it again *really* quickly, charge in PSU survives the very outage without turning off. :-)
-
Monday 5th March 2018 13:55 GMT EddieD
Me, that's who...
Many years ago I went to the small server room we had (this is the late 90s), and there was a KVM that connected to all the machines.
I pressed the relevant button for the NT server I administered, and, as I always did, hit <ctrl><alt><del> to give me a login prompt, as the monitor was waking up. Unfortunately, the KVM had been "rationalised" by my boss, who hadn't updated the switch labels, and I'd just connected to a Linux server console session, which immediately shut down.
The team that were using it for data analysis were a tad miffed.
-
Monday 5th March 2018 14:17 GMT Jay 2
My first proper job, I was on the console of a test server (running UNIX) which I needed to reboot. So I typed in the shutdown command, pressed return and then wondered why instead of seeing the running commentary of the server shutting its services down I was faced with a disconnected telnet session message and the prompt of the test server...
...I then realised one of my collegues has rather stupidly logged into the prod server on the test server's console. So I'd just rebooted the prod server, as stupidly for me I hadn't checked whoch server I was typing the command on. I got away with it as we were a bit of a law unto ourselves, it was a stunningly good education of how not to run a data centre. Once I moved elsewhere I realised things were very different!
-
-
Monday 5th March 2018 16:23 GMT I ain't Spartacus
I suspect they were foolish enough to be honest. "Person has made error is holding button we must reboot." So people act slowly. And argue.
If they'd said, [clickety excuse-o-matic clickety] "NASA has reported an incoming solar flare - we expect to lose computer performance in 8 minutes emergency shutdown and reboot to solar-wind hardened crisis mode." Maybe they'd have had more luck.
Or they could have just said they had to reboot to reverse the polarity of the neutron flux...
-
Tuesday 6th March 2018 10:04 GMT Anonymous Coward
"Jeremy" here.
To be honest: I have no clue, what our service manager told our customer.
I just recall me running out of the DC and through the building to alarm our team lead and the service manager.
No arguing, within seconds people dropped everything they were working on and started sending out emergency alerts. it even went out via the PA!
(That only happened once after that: We got the "all clear" after a bomb threat, due to an American VIP visiting us to present her biography)
I tend to think the arguing and discussing futilities started a few years later, where everybody started "managing" instead of working.
-
-
Monday 5th March 2018 15:49 GMT Anonymous Coward
Wrong server
Back in the days of IBM SP/2s, each shelf in the frame could take one 'wide' node, or two 'thin' nodes.
The numbering of the nodes was such that was such that if only wide nodes were installed, only the odd-numbered nodes (1, 3, 5 etc) were present. You only had even nodes if there were thin nodes present.
My team leader at the time remotely shut down the OS on node 3, and then went to physically power off the node. Starting at the bottom, he counted up the shelves one, two, three, click... and powered off node 5 (because of the missed numbers, node 3 was on the second shelf up not the third).
As this was a commercial bank, he had turned off the main trading server for one of the main trading rooms. He kept his job, because at the time the applications occasionally took the systems out due to paging issues, so he passed it off as an instance of that.
Anon, to protect the guilty.
-
Wednesday 7th March 2018 01:32 GMT Alan Brown
Re: Wrong server
"Starting at the bottom, he counted up the shelves one, two, three, click... and powered off node 5 "
THIS is why I insist on labelling Front and _REAR_ of systems along with their power and network cables (at both ends).
Seriously, if you think you (or anyone else) might be stumbling around an unfamiliar rack, then it's worth spending the time to make sure everything's labelled.
I'd rather have the place looking like the 1960s Batcave than have people not knowing what they just reset.
-
-
Monday 5th March 2018 16:14 GMT Anonymous Coward
Many moons ago a colleague phoned a small Scottish school to diagnose a problem with their dial-up modem link. Nothing obvious, so he got the woman to try a 3-pin reset... "turn the power off, count 5, then turn it back on again". She put the phone down while he waited on the phone. He suddenly went white as a sheet and swore as he heard in the distance "well the man told me to do it"... yes folks, she had gone to the consumer unit and turned off the power to the whole building... luckily it was a one-room school house and only the lights and a single PC were affected (PC just rebooted)
-
Monday 5th March 2018 18:28 GMT 2Nick3
You can do this with NIC settings as well
Had a customer complaining that the admin (and only user-reachable) interface NIC in our appliance had autonegotiated at 100Mb, where it was on a 1Gb switch. He was making a big fuss about it, getting the account team involved (they were negotiating another purchase), being a bit of a pompous arse overall. I told him if it had dropped the speed there was likely a real reason for it (the call-homes showed it changed outside of a reboot), and we should look at the logs on the system, and probably the switch, to figure that out before he did anything.
Like hard-code the speed to 1Gb. Which he did.
I had to send a tech onsite to console into the box to reset the NIC to autonegotiate. The next day the port on the switch was replaced and the interface went back to 1Gb.
-
Monday 5th March 2018 18:37 GMT vincent himpe
Cleanroom suits and power breakers...
Picture this. a cleanroom where integrated circuits are made. A massive multimikllion dollar ion implanter. High voltage, deep vacuum, ion beams, Cry pumps. Magnet power supplies feeding 3000 amperes...
All hanging of a three phase lever switch mounted on the wall. One of those big 'clunk' type rotary levers that are gas-spring operated to shoot the contacts open.
Plant and facilities is called for a small water leak in the service area. The tech goes in and looks at the leak and gets ready to put a small pan underneath while he goes out to get a new piece of teflon tubing to replace. Before crouching down he adjusts his cleanroom bunny suit ( those are uncomfortable if you have to bend over or kneel down. ) while doing so his belt snags at the big power breaker handle.
As he kneels down he feels the snag but it is too late. Ka-lunk : the whole machine goes dark
Vacuum isolation valves lose control pressure and pop open. The 6 meter long beamline sucks in air, pulverizing the poor wafer that sat in the interlock. Ion gauges blow their filaments exposed to the inrushing air. The crypumps lose vacuum and immediately freeze over shattering the traps.
The tech ,scared witless by all the banging and clonking turns around and does the unthinkable...
He grabs the big lever of the switch. and re-engages power to the machine...
...
It took 2 months to overhaul the machine into back up and running.
-
Monday 5th March 2018 19:28 GMT Anonymous Coward
Big cooling tower cluster burns down, hundreds of server shut down by hand
At a TOP2000 company in 1999. Three on-site datacenters (one main, two backup) were powered by an on-site big oil power plant, and additionally connected to the power grid with two (redundant) power lines. The data centers had a cooling tower cluster building near by. The cooling tower ran out of water and the rotating parts in it caused the towers to caught fire. To aid the on-site firefigthers, the internal power plant had to be shut down, they basically shut of all electricity on site, but completely forgot about the data centers. The data centers automatically fall back to battery power that would last for 15 to 20 minutes. The whole complex got evacuated, the admins refused to obey the order and stayed in shut down hundreds of servers one by one, to prevent serious problems with SAP R/2, SAP R/3 clusters and Oracle databases. The cooling towers building burned down to the ground, and with it many cars on the car park near by.
Unfortunately, they learned little from the incident. They rebuilt a carbon copy of cooling tower cluster building, and the admins kept the non-automatic monkey patching method. 15 years on, almost the same incident happened again. This time the power lines to the grid got overloaded, the lines burned through, the power plant turned off automatically, the data centers switched to battery. The on-site telephone system was now Cisco IP phones, and had no backup battery. So no phones. The cell phone tower was on-site, powered by the same power line, so no cell phone coverage as well. So the only communication method were a few analog walki-talkies of the firefighters. Admins had to shut down servers by hand one by one again, this time even more in hurry, as the backup batteries only would last for 15 minutes and lots of more servers and virtual servers were added in the meantime. Obviously, this time some servers were gracefully shut down including what admins thought was important (SAP) but not Oracle, Microsoft, Cisco, etc. Ups.
Have the changed a thing, probably not.
-
Monday 5th March 2018 21:01 GMT Anonymous Coward
Negotiation tactic
Had a similar event back in the late 90s. Worked for a small computer/networking shop that handled IT for larger companies. One had a fleet of servers (mostly Novell, but a few SCO and other odds and ends). Their "rack" was a large multi-shelf workstation with multiple KVMs that was loaded up with tower style PC cases for their servers. This rack had grown over the years and developed a massive ratsnest of power cables, phone cables, CAT5 and even a smattering of Twinax at the back.
We decided to pull and all-nighter to re cable the entire rack. Redo network wiring with better lengths, tidy eveything up, label all the wires, etc. Around 3AM I was in the back of the rack and I heard an "oh shit" followed by a long pause at the front, followed by a request to come over to that side of the rack. My boss was standing there with a finger on the (AT style) power button of a rather critical Novell server.
We were working out of production/business hours, so downtime was fine, but some of those machines didn't take kindly to losing power abruptly. Our process up to that point has been to down the server, then power down. He was doing that, but, being a bit punchy at that time of day, pressed the button on the wrong server. Worse, the KVM was out of his reach.
He asked me to down the server for him... I hinted that I might need to go get a coffee and take a break first. If I were smarter, I would have renegotiated my hourly wage at that point instead!
Anon, cause we got through it (among other incidents) without the customer suffering at all. No harm, no foul.
-
Tuesday 6th March 2018 03:26 GMT The Oncoming Scorn
On a (Happier) Contract A Few Years Ago
Already told the story of....
the offshore support insisted on the former plant Sysadmin hitting the plants BRB, pictures were sent to the remote guy via email, he confirmed that was the button he wanted to be pushed & goodness gracious me it was going to be pushed. He was advised again of what it was that would be pushed & the consequences, the plant manager dutifully informed of what was required, what the offshore wanted & what would be the fallout.& so it came to pass that the BRB was pushed on the word of the Technically Competent Support representative (Paraphrasing here........)"Goodness gracious me, Why s your plant disappearing from network?""
The other was the building shutdown for the energy traders (same company just in downtown Calgary). I was advised from on-high that there was no need for me to be on site to do anything (Mistake number one), that anything that was needed like putting the very very large UPS into "Passthrough" before they did a controlled shutdown could be done (but wasn't) by the site contact.
3am messages from the Sysadmin team in India left on my desk phone at my normal work location didn't get to me for some strange reason of me not sleeping at my desk on a Saturday night (as opposed to normal Monday - Friday mid afternoon stupor (No booze involved) until Monday morning.
On arrival at the site Sunday morning & ringing into the bridge as the sound of silence from the servers was deafening, the building power was on, but the surge had tripped the UPS breakers & each battery had to be checked by the third party techs, before we got the go ahead to bring up the UPS & then bring everything else back up.
Still Sunday double time my minimum 4 hours turned into 6 hours or so which wasn't too bad.
-
Tuesday 6th March 2018 09:04 GMT Black Betty
RECOVER C:\*.*
Having zero knowledge of FAT disk structure, I managed to use a copy of Norton Disk Doctor to find the directory entries on the disk and partially reverse engineer how files were stored. Enough to find the lost sub-directories but not properly suss out how deleted files were actually marked.
So I rebuilt the root directory by hand, and hand patched rest of the disk back onto it the hard way. Come the Monday this APPLE ][ fondler lashed out on a copy of Understanding MS-DOS and discovered just how much easier recovery could have been. But by then the machine was already back in service.
-
Tuesday 6th March 2018 09:20 GMT Murray C
Booby-trapped Cisco
Can remember a few near misses with remote work on Cisco branch office routers - if we were changing say, the crypto-maps on the WAN interface, we would typically do a 'reload in 10' command at the console to reload the router with the last saved config in 10 mins time if we somehow managed to bork the change & lose the link.
...just don't get distracted & forget the 'reload cancel' command afterwards if all goes well!
-
Tuesday 6th March 2018 09:22 GMT nwillc
Back in the day
I was giving a new admin a tour of our "state of the art" machine room. There was was a red button about the size of a cantaloupe marked only FPO. They asked what that was, and I jokingly said "I don't know, hit it." They did.... and the room... yes the entire room proceeded to "Full Power Off". Even the UPSes... they asked "What do we do now?" I could only offer "run".
-
Tuesday 6th March 2018 11:55 GMT Anonymous Coward
sysprep wrong server
my biggest cock up was where I was remotely logged into a hyper-visor and setting up a virtual machine, but I was inadvertently running the commands against the hyper-visor instead of the VM, and took the entire hyper-visor and all the vm's offline as a result. The worst thing was it was a SYSPREP command I was running.
-
Tuesday 6th March 2018 15:34 GMT Goit
My second day working as a Systems Admin for a large law firm back in mid 2000's, asked to take out a couple of servers that had been decommissioned and free up the rack space for a SAN that was going in.
4 power cables, going into two servers, walked around the back of the rack and yanked all the cables out. Except... I had went to the wrong rack and yanked the cables out of both of the exchange cluster servers, bringing down the mail system for about an hour >.<
It was so tidy and symmetrical it looked identical to the rack I was supposed to take the servers out of... Surprisingly they kept me on for a third day :D
-
Tuesday 6th March 2018 17:54 GMT Anonymous Coward
Workstation Power buttons
Workstations -- real workstations, that is -- typically had power buttons on the front. Press it and the box gracefully shuts down. I did this by accident once. No damage, of course, but a lot of lost time. Interesting, the latter releases of the OSes put up a box asking if you really wanted shutdown.
-
-
Wednesday 7th March 2018 10:50 GMT HPCJohn
Re: Label your servers!
I totally agree.
Just flagging up - Dymo lables from office type label guns are useless and will dry up and flake off
You need proper labels on cables and servers.
Server manufacturers - listen up. Put a transparent plastic fixture on the front. 1cm x 5cm
The user can slip a printed label behind this.
The original sun designed pizza boxes had lights out controllers which had a dot matrix display.
Designed so you could display the server name on them.
I installed racks of them at Nottingham Uni. I never did have the guts to print out rude words on the displays.
-
Sunday 11th March 2018 19:07 GMT Alan Brown
Re: Label your servers!
"Dymo lables from office type label guns are useless and will dry up and flake off"
Laminated type labels work well, stick like shit to a blanket and stay put for decades. There are even anti-tamper types, ultra sticky ones for hot areas and ultraflexible types for putting on cables.
At an installed cost of around 6p each they're cheap insurance
Brother and Dymo both make them (I prefer the Brothers) and there are a few others floating around.
-