Oh I think all of us at some time have entered a command in the wrong terminal. But rarely do they have such a dramatic and difficult to recover result.
At least Clint's other office was not 2000 miles away...
Greetings and salutations, dear reader, welcome to yet another fun-filled Monday at The Register. As you well know, each Monday (which is today) The Register (which you're reading) brings you an installment of Who, Me? – our reader-contributed tales of tech gone wrong. This is that very column. To tell you again would surely be …
I was petrified one time I was updating a remote server... in New York, on a Saturday.
Now that may not seem so bad, but I was in the UK, and I had to get back home that night for my wedding anniversary, or my life wouldn't be worth returning to!
Also, I wouldn't have been able to get into the remote site even if I had flown over.
After hitting enter on the reboot command, and an agonisingly long wait, the system finally appeared back on the network, but I was worried for a while!
At one gig I had about a decade ago the large multinational drugmaker I was consulting for had absorbed a bunch of remote sites via acquisition that had their own SANs of various sizes, which they had previously had local people (usually vendor service types since the local datacenter people feared touching it) handle. They wanted all that centrally managed with the rest of their SANs, but few of them were current on patchlevels, they either didn't follow naming or zoning standards or didn't follow The One True Standard of this company, and so forth.
My task was to handle that. Believe me, when you are making changes on a SAN halfway around the world, where the only recourse if something goes wrong is to contact the local on call person (and hope they actually respond) and have them contact the local Cisco/Brocade/EMC tech service rep you had to be exceedingly careful. Not all the SANs were known to have working redundancy, so I would only work on one half of the SAN at a time and one half of an array at a time. While there was a maintenance window it wasn't intended to become a maintenance weekend! The devices also weren't necessarily named well, the DNS name might not be the same as the name it identified itself with at the prompt, most were accessible only via an IP after connecting through a VPN or terminal server, etc. So I was double and triple checking to insure I was on the right switch or controller, at the right location, that it had the hardware version that was claimed for it, that I was applying the proper software for that hardware, basically everything I had to be certain about.
I actually caught a couple cases where the change order specified something different than what I found, and I decided to cancel it rather than proceed. That irritated the project manager who was handling this project, but I was later praised by their VP of storage and backup (that's how you know a company takes it seriously, when that position is VP level...) because it turns out that had I proceeded in one of those cases the switch would likely have been bricked and we didn't know it at the time but the other one had its power supply fail the day after a "health check" of the environment was done prior to approval of the change order.
It is a lot easier making changes to something that's down the hall or across the street, than across the Pacific ocean!
Oh I should add, for those who are wondering why I was tasked with that instead of one of their storage engineers who would be equally capable of such a project? I wondered that too, but down the road figured out the reason - there was a strong culture of "finger pointing" in that team. You were either on the good side of that VP, or you were on the shit list, and all the permanent people believed this project was a ticking time bomb that would inevitably blow up and were more than happy to hand it off to a consultant who was not going to be around to live with the shame of failure.
Course that means they didn't get the credit either, but their top few guys had been there like a decade so they were probably pretty comfy with their salary and stock options and didn't want to take any risks with their careers they didn't have to. Perhaps also they didn't want to have to work on the weekend, whereas I didn't mind spending a couple hours on a random Saturday morning or afternoon (depending on time zone) when it was winter where I lived for most of it so it wasn't like I was missing a tee time or anything.
I spend a whole lot of time managing network equipment remotely, and i've had my fair share of "well, better grab my car keys" moments.
Recently i was managing a network containing 5 netgear switches. The customer didn't want to pony up for stackable switches, so i ended up having to go through each switch manually to make some changes to the trunk ports.
They had about 15 non-consecutive VLANs, so i essentially had to enter a comma-separated list of them, twice. Once the list of "allowed" VLANs, and once the list of tagged VLANs. Our management network was untagged over the trunk ports. Guess who forgot to remove VLAN ID 1 from the list of tagged VLANs on one switch?
I'm not completely sure why this messed up the network as bad as it did, the other VLANs should've kept working without problems, but a short car drive was required either way.
Usually when i lock myself out it's a bit more boring, just small errors when configuring WAN settings.
Instead of a second pair of Cisco boxen, Clint* might been wiser to go for another vendor with an obviously different CLI. Juniper or... Huawei ;). I recall the latter's new enterprise switches around the turn of the century were a fraction of the price of the US vendors. Presumably why so much fun has been had pulling them out later.
On hosts I *always* include the hostname (via $(hostname --short)) in the $prompt/$PS1 especially for superuser accounts. Some who ought to know better when replacing a running machine have the new machine with the same hostname *and* on the network but with different IPs. Plenty of scope there for "les culottes chocolate" (brown breeches.)
* the cinematic references abound "do you feel lucky punk", or "a fist full of dollars" as well as "for a few dollars more" and not forgetting "the good, the bad and the ugly" each of which could apply quite generally to any of the large IT and network kit vendors but perhaps without the "good."
the problem with the 'diversity' approach is that if you want a redundant system you usually want it to be identical to the master system in every respect so you have a high degree of confidence that it will work once failed over. If there are significant differences you'd need to re-test the failover every time there was a software or configuration change to any part of the system
Unless this is something I've misremembered or just hogwash (and may no longer be the case)...
NASA has redundancy from different companies, so they can't both fail for the same reason.
The European Space Agency has redundancies from the same supplier, so when there's an inherent problem in one, both fail.
A cross-street network going down is nowhere near as bad as a code error borking a space mission though, so it'll probably be fine.
cisco 6509 was a chassis switch, yes it was L3 & also you could install a firewall module, redundant supervisors etc etc.
proper beast with its woeful blocking backplane as its achilles heel.
https://www.cisco.com/c/en/us/support/switches/catalyst-6509-e-switch/model.html
firewall module
https://www.networkstraining.com/cisco-firewall-service-module-fwsm/
cisco 7600's where the routers,
along with the old school HP laser printers, the 6500 series would be still operational post the apocalypse.
Yes the supervisor did do routing, but this just made the chassis an L3 switch, as mentioned
Cisco refers to them as switches
https://www.cisco.com/c/en/us/support/switches/catalyst-6500-series-switches/series.html
Adding a FWSM made it a firewall but with limitations vs an asa
https://community.cisco.com/t5/network-security/fwsm-vs-pix-vs-asa/td-p/734843
The main difference between the 6500 and the 5500+RSM (apart from being different generations) was that the 6500 had an integrated RSP and was running IOS natively for BOTH the routing and switching portions. The 5500 (switch portion) ran CatOS, and you would log on to the RSM which ran IOS separately to configure the routing functionality. (Source: Have locked myself out of both models.)
Yes, you could put various L3 Sups in the 6509. Depending on your needs, and just how fat your wallet was, you could go anywhere from basic L2 to full L3 BGP Internet routing. And since the TCAM was small and fixed in size, as the Internet routing table kept growing astronomically, so would your wallet have to to keep up by swapping out the Sup engine of the day so you could keep up.
FWIW: the 7600 designation was the same exact chassis/cards, but marketed by a different BU at Cisco.
If one was an ISP, and had a fat enough wallet, they'd get the 7600. If you started out as Enterprise, you'd get the 6509. Same features, options, Sup's available. Just a different badge on the front, and different sales team on the backend talking to you at Cisco.
Cisco would laugh all the way to the bank either way.
When the stuff you are managing is remote youdont get the luxury of nipping across the road to fix it.
before doing any thing, save the config.
next do a reload in 5.
when the inevitable happens you just wait for the box to reload into its saved config.
juniper has its commit confirmed equivalent which just reverts the config instead of reloading the box.
Ah, the days when you had 40 terminal windows stacked around two 17" CRTs... and being an old-school *nix guy it was focus-follows-mouse, not click-to-focus.
The paranoia over accidentally sending the command to the wrong system was legendary. I spent several hours writing scripts and terminal config files so that dev systems were green themed, QA/test orange and production red. Used various kludges and hacks to get them to pick up when I had a root shell open and invert their colors too.
One of my colleagues though I was being overly silly, and I admitted that I mostly did it because I wanted to find out if I could and if it made a difference (and if I could make it look cool into the bargain, compared to the "vanilla microsoft" look of the desktops belonging to people NOT admitted to the inner circles of systems administration and therefore not provided with a "real workstation" at their desks)...
Until the day that same colleague sent a shutdown command into prod rather than dev and asked me for copies of how I did it.
And the ones who say they haven’t are either lying of have never worked on a production system.
It is a standard interview question round here, we all know you will have done something stupid but how you behave after doing said stupid thing is what is important.
Personally honesty is the best approach, own up before someone finds out it was you and you tried to hide it.
"It is a standard interview question round here, we all know you will have done something stupid but how you behave after doing said stupid thing is what is important."
And just as important is the reply a candidate gives. Same league as "whats the biggest mistake you ever made".
If a candidate replies that they have never inadvertently shutdown the wrong server / device interface Im instantly suspicious. Liekwise if they claim to have never made a mistake, they are either fibbimg, or I wonder how they would respond when they do make the inevitable mistake.
If a candidate replies that they have never inadvertently shutdown the wrong server / device interface Im instantly suspicious. Liekwise if they claim to have never made a mistake, they are either fibbimg, or I wonder how they would respond when they do make the inevitable mistake.
Also Monte Carlo modelling (ok, more the fallacy) would suggest that if they haven't made a mistake like that yet, they're overdue. But it's a question I often ask candidates because it opens up a few things, like how they dealt with the mistake and maybe their honesty. Plus any lessons learned and actions to avoid repeats. Safety rules are wriiten in spilled blood and bits.
But been there, done that and why I'm a big fan of OOB access, terminal servers and as many ways to get back into a box that's been accidently borked as possible. Especially when said boxen may be multiple timezones away. So memorable highlights have been typing 'debug all' on a rather overloaded Cisco running peering and transit for a reasonable sized chunk of the UK. Oops. Or discovering Livingstone portmasters connected to Suns sent a break to said Sun's console ports when they powered up. That one was the sysadmin's fault for not telling us they did this, and deciding they could sleep while we netengs did some upgrades that meant we had to power off & move their Livingstones. Or just any reconfiguring things like IGPs, spanning-tree or of course BGP. There are soo many ways a network can bite the hands that feed it.
And in case you are wondering I have in my career….
Powered off a comms room because I caught the emergency bypass on a ups (the switch stuck out the case by 1mm)
Moved a server on wheels and pressed the power button (very old server where the power dropped when the button was released that spring was very strong
Missed an option on a Cisco debug command and overloaded the router.
Trashed the entire vlan index on every switch for a site with 80+ switches - told the boss there was a problem I had caused and I would fix it it took 12 hours to fix.
And one of the best was driving home from an overnight job on a remote site I got a call from a colleague along the lines of I turned spanning tree off on a port and nothing works in the main data centre. To which the answer was well you are going to have to turn off one of the cores, probably both so the loop disappears wait 10 minutes and then power it back up and make sure you disconnect the port you changed. P.s. go and tell the bosses that you screwed up first. I got asked do I need to tell the bosses to which the answer was simply yes, if they don’t know about the issue know then they will when all the servers go down due to the loop in the network, and it is better they find out sooner than having to come searching for answers.
In my experience (strokes long grey beard) much the most common cause of failure in resilient systems is a single failure (which causes no operational problems) followed by an attempt to repair the working, not failed, device. It never happened to me (honest), but I've heard some tales ...
I worked for a medium sized company (Medco Ltd).
We were taken over by a very big company (Jumboco)
Jumboco's HQ and labs were in a different country.
Jumboco insisted that our nightly builds must be done on their mighty compilers in their labs. They could then be released to our greatful customers as required.
We only had one fat pipe to the outside world, this was obviously a single point of failure and could not be countenanced by a big important company.
Pro tip:- If you are going to dig a long trench alongside the existing cable for the new cable ensure that a) this work is performed on a Friday afternoon and b) a very important software upgrade is being released to one of your most important customers on Saturday.
Silly rhyme I learned many years ago from my grandfather, and later saw in a book of puzzles for children (I think I've remembered it correctly, I mean, we're talking 1970s here, my grandfather probably learned it when he were a lad in the 1910s).
If your grate BMT put: If your grate B. putting:
(read out the punctuation)
M.
Pro tip:- If you are going to dig a long trench alongside the existing cable for the new cable ensure that a) this work is performed on a Friday afternoon and b) a very important software upgrade is being released to one of your most important customers on Saturday.
That’s where you insist on diverse routing, this is where the second link shares no common path with the 1st - coming in different corner of the building - going to a different telco exchange - preferably a different service provider (I.e bt for 1st link, Vodafone fir 2nd).
You wouldn't believe how many geographically diverse circuits travel the same path end to end - some of those install ticks, I mean techs, think that it's good enough if one is on the 101/carrier/pointa/pointb and the other is on the 102/carrier/pointa/pointb. Then guess who gets yelled at when they both drop due to a fiber cut? At least I've gotten good enough at handling irate customers that the next time there's an issue they remember who I am and suddenly they're no longer mad.
I can remember (more senior) colleagues discussing this in the 1980s. Presentation to us might be separate copper pairs going in different directions to different exchanges at each end of circuit. There was no guarantee that they wouldn't be multiplexed together on some PDH/SDH link somewhere in the middle.
These were mostly "music" circuits rented from BT, plus some of that new-fangled data that would supposedly replace our venerable 75 baud telegraph.
"There was no guarantee that they wouldn't be multiplexed together on some PDH/SDH link somewhere in the middle."
I made a point of putting hefty penalty clauses if this is found to be the case
It makes them double check
The funny part is when they backpedal and refuse to allow such clauses - If it never happens, why are they afraid of it being called out in the contract?
Yabbut, site was surrounded by private land, only way out was drive. Unless you wanted to start negotiating for wayleaves with neighbouring farmers.
The real problem was that they took the compiling hardware plus the associated authentication/digital signature box. Without the sig/hash/whatever, the code wouldn't run on our proprietary hardware.
Had they left things alone we could have couriered the signed code to another site or Internet access point.
But they had to make things better.
Eejits.
Yes, classic that. Happened to a former employer... they needed/wanted power redundancy. It turns out that while the power and data cables were routed differently (so some errant digger could've dug them up and we'd be ok), the power did end up being served by the same substation. Substation goes pop, pop goes the power on both ingresses (and no power means no data). Oops.
Thankfully, current employer does have true redundancy (for power and data) on the data centres, although I am still not particularly happy having to park some critical path stuff behind firewalls (not my call) rather than switch ACLs (for our particular use case we only need a handful of ports exposed to $planet, but they must *never* go down). Redundancy is fab, until the big great firewall that was put across both ingresses has a bad moment, and you're not feeling the redundancy (yay, servers are up, but no traffic whatsoever flows, which is pointless).
*shrug* Not my fault, just my head on the platter for the customers. So that's effectively in our risk register as something someone else needs to carry the can for.
Using a second provider seems like a good idea, but I've seen cases where it turned out that somewhere along the way, both providers were using the same trench someone just cut through with a digger. Many providers will offer the option of diverse connections where they ensure that no part of the connection runs in the same physical path or on the same part of the backbone. At a price, obvs.
Reminds me of my time many decades ago working for a major broadcaster in the UK. The Videotape department was the hub of operations, and very little TV was actually "live" - even back then. The quadruplex VTRs required an air pressure feed for the air bearings in the video heads - these things were doing around 15,000 rpm and were servo controlled to incredible tolerances, even by today's standards.
To service the 40-odd vtrs in the basement there were three compressors feeding the air-lines. The whole area could run on two, and over half could run on just one. Triple redundancy you might think. So when I turned up for my shift one morning and found the whole area at a standstill, my initial thought was that a strike had been called - this was the early 70s, after all!
But, no! The compressed air was down! How, you might ask? Well, the air compressors were water cooled, and the council had been digging up the road outside and fractured the main water supply into the building! No water, no compressed air, no VTRs!
As supplied by the manufacturers, the VTRs came with their own small, but very noisy, compressors. These had been removed once the central air supply had been fitted. Fortunately, a few remained stowed away in a cupboard, somewhere, and these were frantically re-installed on three or four machines so that transmission at least could be restored.
Aside from the thrashing of these little compressors, the department was very quiet that day.....!
Lesson learned: Triple redundancy doesn't help when you have a single point of failure!
Many years back I worked for M&S and got transferred to a store in a new out of town retail park, which featured a Sainsbury's at the opposite end of the park and the obligatory Sainsbury's petrol station.
M&S were (at the time) known for having a procedure for *everything* that could go wrong, and a couple of months after opening it got put to the test. There was a huge amount of construction work still going on around us, and some genius managed to sever the buried 11kV line to the retail park substation.
In M&S, the "Oh f**k we've lost power" automated sequence kicked in. Pelmet lights around the store went out, HVAC went into "Ultra-Eco mode", and crucially, the backup generator kicked in nicely, whilst also pinging Head Office to request a replacement diesel delivery ASAP.
20 minutes later, a poor sod from Sainsbury's turned up with a 10 litre jerry can and asked if they could "borrow" some diesel for their generator as it was running out.
"But you've got your own petrol station!?!"
"Yeah, they didn't connect it to the generator."
"Ah. Sucks to be you..."
> a poor sod from Sainsbury's turned up with a 10 litre jerry can and asked if they could "borrow" some diesel for their generator as it was running out.
The company I worked for in the late 70's/early 80's used to supply the generators for Sainsbury's, they had an 8 hour tank (if they bothered to to keep it filled up)
Checking that the backup generator starter motor battery is being correctly float charged and that the batteries actually work
(Happened at one site - the person in charge brushed off my questions on that specific question two years prior to the outage which revealed the issue)
Not sure I believe this one. If he was logged into the router in the building he was in, why did he have to go across the street to console into the router there to restore the connection? Even if they were in a true failover configuration with a single namespace for each pair, it would be obvious which side he was on and which interface was which. If he saw an interface that was in a fault state, why would he issue a command that would take down an interface that was NOT faulted? (I.e., if Serial0/1/0 is down, why would he issue a command to shutdown Serial0/0/0?)
In the early 2000s, I worked for a large ISP on the US east coast. We worked out of an office in Massachusetts. One of the engineers turned off an interface on a router in Virginia which brought down half of our backbone, and for whatever reason there was no management console remote access. Luckily we had a partner or something in the area that was able to drive there and just rebooted the whole router but it was still down for a couple of hours.
There was also a time when something like that happened, and an engineer had to get on the next flight from MA to another state to reboot a router.
In another instance, we started broadcasting routes over BGP for basically the entire IPv4 address space, which our peers accepted, so traffic for the entire Internet began coming to us and dying, causing outages for multiple other providers' clients on the entire eastern half of the US. Outages that take down large numbers of websites and services are sort of commonplace these days with cloud providers that go down, but that didn't happen often back then. We took it as a badge of honor that we were able to cause that much of an interruption and even made what would now be called a meme that we printed and posted in the office. (Something like "Can your ISP bring down the entire Internet? We can.")
Yeah, that's what this story category is for. https://www.theregister.com/Tag/who-me
"Who, Me? is a weekly column in which our readers confess to catastrophes they created in the pursuit of IT excellence - and usually managed to get away with.
"The column is a light-hearted look at the world of work and tech."
So, "usually managed to get away with".
I was once called out to be "a pair of hands" in a data centre to replace an HDD in a server for a major online payment processor. It was "absolutely vital" that this job must be attended within a specified fairly short time frame as the RAID array of five disks "only" had 2 hot spares. In other words, not really all that time critical! So I'm on site, chatting with their server guy on the other end of the phone having already identified the failed HDD from the blinkenlights while he talks me through the process and says he;s going to flash the LEDs. Fair enough, he doesn't know me or my skill level. Job done. So I mention that there are two PSUs for redundancy, only one has blinkenlights on. A short moment of silence followed by obvious clicky-clacky keyboard sounds and "oh shit, the monitoring alert hasn't been set up for that!" So an org in a "panic" over a failed HDD with something in the order of 3 levels of redundancy (RAID inc. parity, + two hot spares) and the real, more urgent issue is two redundant PSUs, one of which, according to the logs, he told me has been failed for 3 months.
Sometimes it's not just a case of monitoring stuff, it also about checking that the stuff that needs to be monitored and/or have automated alerts for, is actually set up to monitor and alert :-)
The year must have been just around 2000. I was still leaning but one task was customer support. We hosted several customers “on prem” servers. One of these was an Exchange Server. At the end of the day I had to do some work on it, so I remoted into it with Remote Desktop. I did most of the work in full screen - low budgets meant crappy computers with crappy monitors, meaning not a lot of space to work on if there where in a window. Anyway, I’m done with my work, talk with a colleague and shut down my computer, but a few seconds later I see my real desktop. My first thought was “That was odd” but then realised what I had done. Also realised the pop up I had just clocked ok to when I shut down the server, was Windows telling me that I was on a remote server and if I was really sure of I wanted to shut it down.
I called the customer right away a told them of my blunder and ran down to the server room in the basement, where the server was still shutting down…
From that day I leaned to take extra care and attention when doing remote work. Always double check before doing potentially disastrous actions.
"From that day I leaned to take extra care and attention when doing remote work. Always double check before doing potentially disastrous actions."
MS can shoulder some of the blame for making everyone "click happy" with so many unnecessary pop-ups with "Are you sure?" on them effectively training people to auto-click without reading.