Real life stories like this always worth a nose
This Damn War image via Shutterstock I took the plunge and became a freelance IT consultant in 2001. Through an unlikely series of coincidences (former colleague from London goes to travel show in France and bumps into two guys from Yorkshire who are looking for a software and database architect) I ended up in North Yorkshire …
That was actually filmed only a couple of miles down the road from where this story was based.
I've got to say though that Twistleton Scar is distinctly more wild than the rolling drumlins around the location in question.
Still, nice to have a story from my backyard!
Yup Cisco tried to sell this to my customer. 'Anywhere in the UK, sir'.
The customer was sceptical 'and how doe's that work in a gale when Guernsey airport is closed and the ferry cant dock, for three days due to high wind and seas?'
Instead, they carried a complete set of spares, themselves.
There we were working away in Central Asia when one of the Two servers died. We found that it was a CPU board failure. The nearest one was in Moscow (3 time zones away). After several long phone calls we dispatched one of the client team to the Airport. He caught a flight to Moscow where he was met by the Field Service Manager. A CPU Board swap took place in Sheremetevo Airport and the return flight was duly caught. The Local went because he didn't need a visa to enter Russia. Us westerners woul dhave needed one. The airfare at the time for locals was 1/4 of that of us rich westerners.
A little under 10 hours after the crash the System was up and running again.
The IT director took us out to dinner for fixing the system in the way we did.
This was in the Mid 1990's. Those were the days.
An employer in the 1980s had scheduled an upgrade to a VAX. The engineer was supposed to be on site for a few hours during which the system would be unavailable -- timed to cause minimum impact to customers in Europe and Japan.
Everything was going well until the engineer stood on the last board to be fitted. Ouch. DEC found a replacement in Manchester and it arrived three hours after the accident. That's a pretty good time for handling a distress call after normal working hours, finding the board in a warehouse and driving it to the Midlands. The VAX was back in service later than scheduled but no customer complained.
It was an early learning experience:
* Competent plans go wrong in unexpected ways;
* Great suppliers are the ones who do a good job when things go wrong, not organisations that (unrealistically) never make mistakes;
* Set honest expectations -- with staff and customers -- and be frank about errors and problems.
Not a lot of room between the back wall of the server room and the server rack. You are right though -- don't put kit on the floor unless there is nowhere else to put it. It's human to trip over.
My argument was about how the organisation providing a service responded to a foul up.
DEC found a replacement in Manchester and it arrived three hours after the accident. That's a pretty good time for handling a distress call after normal working hours, finding the board in a warehouse and driving it to the Midlands.
DEC were good at that, sadly it's probably the cost of that sort of service that killed them.
Our office in Belfast was bombed late one Friday afternoon (we had the misfortune to share the building with a tax office). No-one was hurt, everyone evacuated in time, and the servers all came back up OK on Saturday once we had the all-clear, but the offices were uninhabitable (smashed windows, ceilings down) and most of the terminals on peoples' desks were wrecked.
Our boss put the DR plan into effect on Friday evening, and phoned DEC. Saturday lunchtime, while our own guys were cabling up spare space in the building next door, DEC arrived with a vanload of new terminals and other kit, driven up from Dublin. Local DEC guys helped get them set up, and by 9am Monday morning everyone had a desk and working terminal.
I'm not sure it would happen that smoothly these days...
First off, it's always almost faster to listen to your user base and their ability to detect service failures than to rely on autonomic monitoring systems. Even remote telemetry solutions like Sun/Oracle, etc have are generally slower than the user picking up the phone to cut a ticket. Still helps with failures that don't cause an outage that you might not notice until you scrubbed a log, but, when the server completely craps out you will likely know before your monitoring tools do.
Remote locations often consider an onsite parts agreement so that critical components are in the DC already for the engineer or the customer themselves to use to restore service without waiting for delivery of parts that could be delayed due to weather, traffic, or because the one part you need is out of stock at your local stocking location.
Most of the time its probably cheaper (long term TCO) to have N+1 redundancy than to rely solely on Premium support SLA to keep you in business. Depending on the costs of an outage you might be able to get by with business hours support on gear that you can afford to lose availabilty on for a few hours. Clustering, load-balancing, now "serverless" application designs or VM/container mobility strategies can buy you time to diagnose and restore individual nodes without having to make the panic call to the vendor at 0-dark-thirty. Of course back in the day of this story there were less options on that front and the redundant gear tended to be a little pricey to be left idle.
Cool Story Bro
IIRC, this was an issue for the different UNIX flavours of the period, they could swap a failed CPU out as long as that wasn't the monarch CPU running some of the kernel strings. TBH, it was a great way to scare manglement and get budget for a second system and clustering software, to point out that in a 4-way server a CPU failure was 25% likely to be the monarch, a crash and a total loss of service. "25%" sounded scary, I just used to omit the small likelyhood of a CPU failure into the maths.
As for no "SSDs" - ahem - yes, there were solid state devices available. In 2001 I was using Texas Memory Systems' Ramsan solid state boxes to boost Oracle databases.
Mmmm, unless the system somehow rolled random numbers during boot up, the odds are that the very same CPU was boss after every boot up simply because of the physical layout of the system. That would mean that one CPU would likely see greater wear and tear so-to-speak than all the others. So those 25% odds were probably weighted toward the house more than you might expect.
Well, NAND flash SSD, maybe.
Basically, core memory is SSD too. And in the 1990's several manufacturers had a couple of solid state drives in their program DEC had one, physically the size of a HSC50 (can't recall the model number; ESE50?) which was essentially a backplane filled with 150MB worth of DRAM boards and a SDI interface, plus a MVAX board with an RD54 hooked up and an UPS. If the power went out, the UPS was to keep the lot running while the memory contents were transferred to disk. Later they had a drive with a 3.5" form factor, SCSI interface, static RAM and a rechargeable battery. Couple hundred MB, IIRC. No idea of the list price of either, but definitely well over that of their size in spinning rust.
I have only worked for one business where the owner was willing to pay for a service contract. For the rest, we made up a song, "The Electron' Swap" to cover how "service" was done:
"I entered the office late one night,
The hardware systems were a ghastly sight,
Our two 'hardware specialists' had their screwdrivers out,
there were pieces of gear all strewn about.
They did the swap, the electron' swap..."
The nearest Frys was over an hour away.