Alternative Lesson: "Never turn anything off if..."
"you don't know how to turn it off"
FTFY
Welcome once again, valued reader, to Who, Me? – The Register's weekly confessional column in which readers recount their tales of derring-do that derring-didn't. This week meet a reader we'll Regomize as "Doug" who found himself in something of a hole when, as an ops manager for a food supply company, he oversaw a massive …
Time taken for a senior manager to formulate and execute plan for an impromptu DC failover using the Big Red Button? Microseconds.
Time taken to get everything back up and working, failed boards replaced, systems restarted, rogue LAN cables replaced, comms balanced? 39 hours.
Size of boot applied to arse of senior manager by Very Senior Manager? <-------------------- This Big -------------------->
However, if a power failure needs serious effort or hardware fixes to get it going, its a shit system.
Who here has not see a UPS fail and simply take out the supply instead of going to bypass? (Think of any unfortunate APC owners)
Or, less commonly, a local digger has JCB'd the local 11kV feed and your power is off for hours and UPS exhausted? (Generators are available, some of them might even work when needed)
... if a power failure needs serious effort or hardware fixes to get it going, its a shit system.
Exactly my thought when I was reading the article.
And if it isn't in the manual it ought not to happen.
Well ...
Reality/fate/chance or serendipity do not look at manuals to get their cues as how to proceed.
Shit just happens.
20+ years ago I was involved in the set up of all the hardware at a local election vote-tally data entry centre.
It was a 300 PC, 2X heavyweight mirrored Compaq servers, industrial size UPS + automatic start generator gig.
The very short deadline and the usual opposition rabble with daily accusations of vote tampering did not make things easier as the team were in the papers every day.
Not a nice scenario and as it was my first job after a very long dry spell, I was not going to let anything pass.
When everything was set up and all the individual parts of the system were duly tested and approved, a dry run was scheduled.
It was with the 300+ staff working on mock/simulation vote tallies and in the midst of it all, just to see their faces, I asked them all just what would happen if power went out at any moment.
A flurry of explanations were given, all very correct pointing out what should (according to manuals) happen in such an event.
I replied that the proof was in the pudding, that it had to be tested without notice to anyone.
Protests ensued but I refused to budge from there, clearly stating I would not sign off on anything if not tested as I required.
Not at all happy, all parties involved finally accepted and all went as expected.
That's the only way to know if things work properly.
O.
All happily sitting in our office on a sunny Tuesday morning and suddenly a huge bang outside , the whole building goes dark, screens off, beeping everywhere as the building UPS kicks on to save the comms room. It's the bomb, run for the shelters! Nope. One of the council's finest digger operators decided he wanted to cause untold chaos up the whole street by cutting through a stack of cables feeding several businesses. Cue 60 mins of dusting off DR plans, switching off non-eessential kit and praying the UPS holds, 4 hours later the leccy board hooked everything back up temporarily while they decided how to fix the mess properly. Took about 2 weeks of constant minor power cuts, which resulted in a few PCs going bang and countless insurance claims from the businesses in the area as old kit finally gave up the ghost after too many power blips.
None of this was in any manual we had, so we wrote our own from the experience!
@Admiral Grace Hopper
Seeing as you currently have 42 upvotes, and I is 'vogon-ish', I have to reply to this.
Size of boot applied to arse of senior manager by Very Senior Manager? <-------------------- This Big -------------------->
Would that it would be so....in my experience the shit flows from VSM->SM->M->me, even if it should have stopped at SM or M level:-)
"in my experience the shit flows from VSM->SM->M->me, even if it should have stopped at SM or M level:-)"
Indeed, and the amount of excrement gets deeper with each descending level, so that the VSM-->SM "Dont do that again Clive please" ends up with M--> grunt "youre lucky you werent fired"
Turning it back on usually is the problem
Especially if, like some of our more 'legacy'[1] systems, turning the system back on isn't just a case of turning all the servers on.. no - a carefully-scripted set of actions (turn on server A, enable services in a specific sequence and timings, turn on server B, wait 5 minutes, turn on server C etc etc).
[1] Short for "we'd like to take them out the back for a mercy-killing but the business won't let us"
And more often the zero'th step - work out exactly what legacy kit you really have, and where the **** it's been hidden.
Having brought up servers A, B and C, only to find out that they are relying on mysterious servers AA and AAA for various services and data feeds and that no-one knows anything about as they've been quietly humming away in the back of a cupboard/under a desk/behind a partition wall/in some dusty basement for years without anyone actually being aware of them.
And of course by the laws of sod, those are the ones that are the key foundation for everything and haven't been touched for years and so get mugged by the dust bunnies that have built up inside their cases, or when their fans/PSU's just screech to a halt or emit the magic smoke...
Back in the day Novell updated the network protocols to improve performance on large networks,
it was necessary roll out the change stack to all servers in a network before flipping the setting to use the new protocol, if any old style servers were still powered up they would be used as the gateway by default.
Sure enough despite weeks of audits, server upgrades, comparison of real devices found with asset registers when we made the change the network stopped.
We were expecting this and were perched looking at network traffic and soon found which network segment was taking all the traffic, phones were ringing off the hook and the entire organisation was stopped. We swatted the affected office and sure enough there was no server visible. a search of all cupboards revealed a hole with a lan cable and power cable in the side of one, sure enough under a pile of other rubbish there was a 286 server lying on the floor inside the cupboard. It must have been at least 8 years old and had survived with very little ventilation and numerous powercycles with no attention for at least 4 years.
When we got it back to our office the only service it was running was a fax gateway, at some point it had been decided that it was unsightly and someone had paid a joiner to make the hole in the cabinet then shoved the server in there and forgotten about it.
Yep -- frightening how often this happens. I was working in a telephone office when a *lot* of alarm bells starting sounding. Outside, two lines of pin-flags were delineating the path of the cables between office and world. A digger decided to put his bucket right between the two lines. It took some time to splice them all (working round the clock). This happened decades ago; the manager first thought that the system was being overloaded and started shouting: "Has Nixon resigned?"
Tentatively, I am coming to believe the way to avoid this is to, at the implementation stage, have the system's developers follow a process that involves switching it on and off a lot, and tearing it down and re-deploying it a lot. Ideally with all the components booting up in a more or less random order.
So the processes for deploying and cold booting your system will all get thoroughly tested many times and the developers themselves will be very severely inconvenienced if it doesn't reliably come up cleanly without hand holding. Hopefully they'll respond to this by fixing the software to boot reliably (e.g. making each part come up cleanly when its dependencies eventually show up).
Was absolutely terrified of this today, small 19" cabinet in a nursing home, doing the internet routing to the building, cctv, and medical records for residents, much of it mandatory for their insurance/license, said cabinet contained an ancient looking APC UPS. Our job (should we choose to accept it) was to feed more sockets off the circuit it was plugged into. Thus needed to kill the power to that circuit. There was another socket on the other side of the office, on a different circuit (different distribution board and different electrical intake even lol). I had the nerve wracking job of yanking the plug out, and slamming into an extension lead from the opposite socket. Imagine the HORROR when the (borrowed) extension lead arced and crackled while i wrestled to put the plug in. Fortunately the ageing UPS not only had a functioning battery, but withstood the barrage of arcing impeccably.
They are however going to schedule a shutdown to allow us to replace the 12v batteries as it went down to 30% in that few seconds.
This post has been deleted by its author
""One side of mirror switched off; mirror does its job" ... and that's it?"
Just a couple of years ago, we were moving an AS400 that was, in everyone's memory (and not on documentation), configured, $DEITY knows how, as a mirrored active/passive AS400 metro-cluster.
Upon switching the passive side on, we realized the active side switched to everything Read-Only.
We never had time to investigate this, but moved the passive AS400 as quickly as possible.
But then again, this legacy system was to be decommissioned. Has been in this state for multiple decades :)
Sometimes the manual omits / hides useful information.
I remember working on an IBM 1130 system at JPL (yes, it was that long ago; punched cards to load programs, big panel of blinking lights and toggle switches instead of a monitor, 8k of RAM…).
The system flat out refused to run Assembly and Fortran at the same time.
An enterprising system engineer I worked with discovered that hiding in the config byte-array at the head of RAM was a bit that enforced this. Changing that bit allowed Assembly and Fortran to run and call each other.
We couldn’t be bothered to check if IBM offered an upgrade to allow both languages to run at the same time.
Geezer icon —->
I used to be an IBMer and had a sign in my cube that said "Nunquam Permissium Opus, Interfere Per Plactium"... Internet Latin for "Never Let the Work Interfere With the Meetings"
I figure if your going to be a smarta$$ at work it's best to do it in Latin. It's classier and no one knows what it means anyway (except I'd tell anyone who would listen).
When the manuals were referred to as "eyes only", my immediate thought was "brain not allowed". Which was then duly carried out by the IBMer Bob - "no, we can't do a test that simulates real-life (including startup), we MUST do it exactly the way the book says!"
I've told this one before.
A bank was hit with a power outage, and they went into the well practised fail over the backup site. A passing senior manager/director told them - don't do that - we have generators in the car park for this sort of emergency - use those and avoid the outage of switching sites. So they reluctantly restarted in place. Half way through the power on and restart, they found the generators did not have enough power for the machine room and so were stuck in limbo. They did not want to shutdown half way through an emergency restart, and they could not complete the start up to be able to shut it down. They had to wait a couple of hours before the power was restored, and they could complete the restart.
When the incident was reviewed by the board, the manager/director had to explain it was his decision, and admit he did not actually know about the generators capacity, he just paid for them.
The IT team learned a lesson - there are times when you ignore the management chain and do what you have practised.
The IT team learned a lesson - there are times when you ignore the management chain and do what you have practised.
That only works in situations where the (unknown, future) outcome is a success though. If the IT team tried "anything" and it did not go according to plan it would 100% be seen as their fault irrespective of the circumstances that preceded it.
I'd write down on paper:
"""
To whom it may concern,
I am aware that there has been a power outage, and the power is still out.
I have been told by IT that the documented, agreed, and tested procedure is to fail over the IT systems to the backup site.
I have been told by IT that restarting the IT systems here, on generators, is not the documented or agreed procedure, and has not been tested.
I am ordering IT to try to restart the IT systems here, on generators.
Signed
___________
<Name of senior Manager>
Date: xx/xx/xx Time: xx:xx
"""
Then I would give that piece of paper to the director and ask them to sign it to confirm the order.
Any sane person would take one look at that and refuse to sign it, and let the IT people follow the plan.
If the director is stupid enough to sign, then they get what they deserve.
Many many years ago I was asked to create some tests for a file storage system. The intention was that a file could be moved between two storage areas, but you only ever saw it in one. The whole file could be accessed from 'A', or 'B', but you should never see it in both at once, nor should you ever be able to see a partial file anywhere. I specified: use a large file so you have a few seconds to act in, start the transfer, disconnect the network cable, wait a few seconds, check. No problem with that one. The second variation said start the transfer, disconnect the power lead from one side, then the network cable, repower but do not reconnect the network, wait for startup to complete and check. The project flat out refused to run this test. Considering this was a system intended for usage in combat areas (even though in staging posts rather than front-line), I did not consider it an unreasonable scenario. However, the manual stated that systems must always be shutdown down cleanly by following the specified procedures.
…OK, so pressing The Big Red Button is not something to be taken lightly, or to be done without consideration of the consequences, but, given, that this is supposed to have been a resilient process if Bob doesn’t like the idea and can’t come up with a better reason than “the manual says not to do it, and there’s nothing in our procedures about it…” then someone ought to be giving him Hard Stares and asking him Awkward Questions, and other people ought to be answering Awkward questions about why Awkward Questions weren’t asked before Bob’s employer got the deal.
whilst a 2nd year BTEC Electrical engineering student back in 89/90, I "accidentally" cut power to the entire college campus! Picture the scene about half a dozen spotty 18 year olds dispatched to the college main electrical switch room to make sketches of and label what we found in there. Any road up in we all went in to this quite small room armed with A4 binder, paper and pencils. Had a bit of a butchers around, ok that's obviously the main breaker for the campus, chuffing great BIG switch like a one armed bandit, hmmmmmm what's this small flippy thing on the side of it says I. Press, CLUNK as the massive handle descends, then an eerie quiet as EVERYTHING was off! This lasted a few seconds until broken by the sound of my mates saying oh sh1t quite loudly and the sound of doors opening and lectures coming out in to the corridor wondering what the feck was going on! Yes I had successfully found and identified the main breakers test switch!
My mates, who were all fab and several are still in touch 30 years later, covered for me and said I knocked the test switch with my A4 binder. For a whole term after the college clock and bell was about 5mins off as that's how long it took to get the power back on and they didn't reset the clock hahahahahahahahahah
About 15 years ago, a colleague and I went on site to a local PoP to install a third switch into an existing Cisco 3750 stack. The plan was simple - screw the switch in place, then just remove one of the two stacking cables between the existing ones, connect the third switch in-line, add a third cable, flip the breaker and configure the new ports. Simple, right?
Only my colleague had a brief brain fart and flipped the breakers for the OTHER two switches, essentially shutting access to the whole PoP down - taking some thirty-odd-thousand customers connected downstream with them. And to add insult to injury - when the stack booted back up, the NEW switch became the stack master so all the port numbers were transposed and all the old ports had to be reconfigured. Via console. As both our phones were pinging like crazy trying to tell us that we'd taken down the whole site. Trust me, we knew. And that was they day we learned to always pre-configure the stack member serial numbers.
(We still work together. And our colleagues still to this day remind us of that incident whenever we walk towards a Data Center together.)
Needed to get something off a shelf above the ups, this ups had an emergency off button that was about 1mm proud of the panel it was mounted to.
Caught it with my knee and dropped the main comms room and a large as400. It took an hour for the as400 to come back up.
I left the company 21 years later so it didn’t do any long term harm to my career.
While being less senior than now, in a moment of madness I decided to see if I could rebuild the entire DB and Data from scripts.
It took me a few days to get it working and I was then able to recreate from scratch the main system of the company feeling pretty good about it.
I went home thought nothing of it until we came in the next day with the fire brigade lights flashing and the building shut, a water pipe had bust, and in the main server room a tsunami was occurring.
The bosses were panicking about the business not surviving so when I mentioned I could rebuild the system in a matter of hours IF they got my desktop out.
Later that day the system was back up and running slowly on my PC.
Hero back slapping ensued then it struck me the data I had been working with was at least a week out of date, best not to ruin their day so I kept quiet.
One of my Finest hours
Sounds like he managed to come up with a scenario not covered in their manuals, which absolutely should be. The "trigger happy" user who just goes around pushing buttons. If it happened once, it's bound to have happened before and would happen again. IBM should give the guy a bonus or something for finding that particular gap in their process.
A number of years ago on a new hospital build with lots of IT
It was time for the power down test on a data center. A proper power test - severing of supplies etc to test data centre failover.
Everyone in a room. For some reason Data Storage guys were adamant they needed a soft shutdown because anything else would knack the arrays.
Idiot here popped up and pointed out that if they were going to be damaged they weren't a lot of use in a major power loss scenario.
All hell breaks loose with lots of angry questions from the customer to the Storage specialists.
At that point I made my excuses and left before I got lynched.
Sounds a perfectly reasonable request to me.
If the system is that sensitive then it needs its own way of reacting to a power down, either a local ups or a battery system to write the final cache entries to the disks before it dies. It should be part of the chassis or firmware that handles this.
Otherwise it has there are things out of its control and must be able to deal with them.
A few years ago, I was part of a small team workinging on an inventory management system. Because we couldn't find a commercial one that fitted our needs 100%, our manager initiated a project to design and build one.
We did, and the final system worked well for years. We had the development and test servers set up properly, with us doing development on the development, testing it as rigidly as possibly with the resources available to us. One of my colleagues also had the responsibility to roll out updates to the production server. In theory, he was the only one with rights to do this. This was a deliberate choice to prevent accidental updates. In fact, the system was slightly bureaucratic, again deliberately, to prevent accidental updates.
The system had been in use for some months, when all of a sudden, the users started reporting errors. My colleague and I started investigating. The third member of our team apparently had a meeting, so vanished.
After about 15 minutes investigation, we had worked out what happened. The system had its own database on SQL server. It recorded everything the system did as a "transaction" and the transaction table had vanished.
Upon further investigation, we found the transaction table had been renamed as a full stop. We found that the technician who had vanished for a meeting had renamed the file. Unfortunately, the version of SQL Server Management Studio wouldn't allow us to access the table after this, so we had to restore it from the previous night's backup. Thankfully, we only lost a couple of hours transactions.
A major oil firm realised they had a power outage at "DC1" but UPS kicks in perfectly
Process initiated to bring down systems carefully, as "DC2" is absolutely fine
Systems/workloads closed satisfactorily, well in advance of the UPS dying
...
Corporate meltdown then ensues as they'd shut down the systems in an orderly fashion in DC2
Bill Gates (for it is he) was so convinced that an install of the latest Windows was bulletproof that, while it was being demonstrated by his VP on stage and on live video across the planet, he pulled the plug on the test PC in the middle of the process.
To say that the VP was "white faced" at that point is an understatement.