back to article I made this network so resilient nothing could possibly go wro...

Greetings and salutations, dear reader, welcome to yet another fun-filled Monday at The Register. As you well know, each Monday (which is today) The Register (which you're reading) brings you an installment of Who, Me? – our reader-contributed tales of tech gone wrong. This is that very column. To tell you again would surely be …

  1. Paul Crawford Silver badge

    Oh I think all of us at some time have entered a command in the wrong terminal. But rarely do they have such a dramatic and difficult to recover result.

    At least Clint's other office was not 2000 miles away...

    1. Korev Silver badge
      Coat

      Luckily for Clint this wasn't terminal for his career...

      1. Anonymous Coward
        Anonymous Coward

        Even so, I bet his trousers were a bit shiTTY...

    2. ICL1900-G3 Silver badge

      Yes, had to drive up to the City from Cornwall after one foolishly typed command on the wrong server!

      1. Peter Gathercole Silver badge

        Long journey to fix a (thankfully only potential) problem?

        I was petrified one time I was updating a remote server... in New York, on a Saturday.

        Now that may not seem so bad, but I was in the UK, and I had to get back home that night for my wedding anniversary, or my life wouldn't be worth returning to!

        Also, I wouldn't have been able to get into the remote site even if I had flown over.

        After hitting enter on the reboot command, and an agonisingly long wait, the system finally appeared back on the network, but I was worried for a while!

    3. Anonymous Coward
      Anonymous Coward

      I remember this conversation at a former employer when they decided all the network support should be sent to India. What happens if the problem stops the Indian support teams accessing the in-UK network gear? "*crickets*"

      1. Yet Another Anonymous coward Silver badge

        I think New Zealand just solved that problem

    4. DS999 Silver badge

      At one gig I had about a decade ago the large multinational drugmaker I was consulting for had absorbed a bunch of remote sites via acquisition that had their own SANs of various sizes, which they had previously had local people (usually vendor service types since the local datacenter people feared touching it) handle. They wanted all that centrally managed with the rest of their SANs, but few of them were current on patchlevels, they either didn't follow naming or zoning standards or didn't follow The One True Standard of this company, and so forth.

      My task was to handle that. Believe me, when you are making changes on a SAN halfway around the world, where the only recourse if something goes wrong is to contact the local on call person (and hope they actually respond) and have them contact the local Cisco/Brocade/EMC tech service rep you had to be exceedingly careful. Not all the SANs were known to have working redundancy, so I would only work on one half of the SAN at a time and one half of an array at a time. While there was a maintenance window it wasn't intended to become a maintenance weekend! The devices also weren't necessarily named well, the DNS name might not be the same as the name it identified itself with at the prompt, most were accessible only via an IP after connecting through a VPN or terminal server, etc. So I was double and triple checking to insure I was on the right switch or controller, at the right location, that it had the hardware version that was claimed for it, that I was applying the proper software for that hardware, basically everything I had to be certain about.

      I actually caught a couple cases where the change order specified something different than what I found, and I decided to cancel it rather than proceed. That irritated the project manager who was handling this project, but I was later praised by their VP of storage and backup (that's how you know a company takes it seriously, when that position is VP level...) because it turns out that had I proceeded in one of those cases the switch would likely have been bricked and we didn't know it at the time but the other one had its power supply fail the day after a "health check" of the environment was done prior to approval of the change order.

      It is a lot easier making changes to something that's down the hall or across the street, than across the Pacific ocean!

      1. DS999 Silver badge

        Oh I should add, for those who are wondering why I was tasked with that instead of one of their storage engineers who would be equally capable of such a project? I wondered that too, but down the road figured out the reason - there was a strong culture of "finger pointing" in that team. You were either on the good side of that VP, or you were on the shit list, and all the permanent people believed this project was a ticking time bomb that would inevitably blow up and were more than happy to hand it off to a consultant who was not going to be around to live with the shame of failure.

        Course that means they didn't get the credit either, but their top few guys had been there like a decade so they were probably pretty comfy with their salary and stock options and didn't want to take any risks with their careers they didn't have to. Perhaps also they didn't want to have to work on the weekend, whereas I didn't mind spending a couple hours on a random Saturday morning or afternoon (depending on time zone) when it was winter where I lived for most of it so it wasn't like I was missing a tee time or anything.

  2. Michael H.F. Wilkinson Silver badge
    Coat

    He pulled the trigger too fast

    on that command to toggle the port.

    Must be something in his regomised name.

    I'll get me coat

    1. Korev Silver badge
      Coat

      Re: He pulled the trigger too fast

      I guess you're running to the Eastwood...

      1. Neil Barnes Silver badge
        Coat

        Re: He pulled the trigger too fast

        He doesn't want his coat; he wants his poncho.

        1. Anonymous Custard Silver badge
          Trollface

          Re: He pulled the trigger too fast

          But then he'd have been regomised as "Null" or "None"?

    2. Mentat74
      Coat

      Re: He pulled the trigger too fast

      Guess the "punk" didn't feel too lucky after that !

    3. KarMann Silver badge
      Trollface

      Re: He pulled the trigger too fast

      I don't get it.

      What does Hawkeye have to do with this? And bows and arrows don't even have triggers, anyway!

  3. Korev Silver badge
    Coat

    I hope the network outage didn't cost his employer a packet...

    1. l8gravely
      Devil

      Just a fistful of dollars...

      1. collinsl Silver badge

        Clint's lucky that they let him back for a few dollars more

  4. SVD_NL Silver badge

    Ah, remote network administration...

    I spend a whole lot of time managing network equipment remotely, and i've had my fair share of "well, better grab my car keys" moments.

    Recently i was managing a network containing 5 netgear switches. The customer didn't want to pony up for stackable switches, so i ended up having to go through each switch manually to make some changes to the trunk ports.

    They had about 15 non-consecutive VLANs, so i essentially had to enter a comma-separated list of them, twice. Once the list of "allowed" VLANs, and once the list of tagged VLANs. Our management network was untagged over the trunk ports. Guess who forgot to remove VLAN ID 1 from the list of tagged VLANs on one switch?

    I'm not completely sure why this messed up the network as bad as it did, the other VLANs should've kept working without problems, but a short car drive was required either way.

    Usually when i lock myself out it's a bit more boring, just small errors when configuring WAN settings.

  5. Bebu
    Windows

    redundancy and diversity?

    Instead of a second pair of Cisco boxen, Clint* might been wiser to go for another vendor with an obviously different CLI. Juniper or... Huawei ;). I recall the latter's new enterprise switches around the turn of the century were a fraction of the price of the US vendors. Presumably why so much fun has been had pulling them out later.

    On hosts I *always* include the hostname (via $(hostname --short)) in the $prompt/$PS1 especially for superuser accounts. Some who ought to know better when replacing a running machine have the new machine with the same hostname *and* on the network but with different IPs. Plenty of scope there for "les culottes chocolate" (brown breeches.)

    * the cinematic references abound "do you feel lucky punk", or "a fist full of dollars" as well as "for a few dollars more" and not forgetting "the good, the bad and the ugly" each of which could apply quite generally to any of the large IT and network kit vendors but perhaps without the "good."

    1. Tim 11

      Re: redundancy and diversity?

      the problem with the 'diversity' approach is that if you want a redundant system you usually want it to be identical to the master system in every respect so you have a high degree of confidence that it will work once failed over. If there are significant differences you'd need to re-test the failover every time there was a software or configuration change to any part of the system

      1. ChrisC Silver badge

        Re: redundancy and diversity?

        OTOH, the identical approach may leave you exposed to common-mode flaws that could take down your entire system at the same time...

        1. Yet Another Anonymous coward Silver badge

          Re: redundancy and diversity?

          Yes but that would be Cisco's fault so you can't be blamed.

          Accidentally doing a Cisco command option on the other router is your fault

        2. Doctor Syntax Silver badge

          Re: redundancy and diversity?

          Best go for three, then. Two identical to fail between normally and a different one to fail to when the common-mode hits. Or four so that there are two redundant pairs.

        3. Cessquill

          Re: redundancy and diversity?

          Unless this is something I've misremembered or just hogwash (and may no longer be the case)...

          NASA has redundancy from different companies, so they can't both fail for the same reason.

          The European Space Agency has redundancies from the same supplier, so when there's an inherent problem in one, both fail.

          A cross-street network going down is nowhere near as bad as a code error borking a space mission though, so it'll probably be fine.

  6. tip pc Silver badge
    Mushroom

    6509 was a chassis switch not a router

    cisco 6509 was a chassis switch, yes it was L3 & also you could install a firewall module, redundant supervisors etc etc.

    proper beast with its woeful blocking backplane as its achilles heel.

    https://www.cisco.com/c/en/us/support/switches/catalyst-6509-e-switch/model.html

    firewall module

    https://www.networkstraining.com/cisco-firewall-service-module-fwsm/

    cisco 7600's where the routers,

    along with the old school HP laser printers, the 6500 series would be still operational post the apocalypse.

    1. ShortLegs

      Re: 6509 was a chassis switch not a router

      Pretty sure the 6509 supervisor engine featured routing, so it could be used as a router

      IIRC it was the replacement for the CAT5xxx and RSM (router switch module), to perform inter-vlan routing and switching.

      1. tip pc Silver badge

        Re: 6509 was a chassis switch not a router

        Yes the supervisor did do routing, but this just made the chassis an L3 switch, as mentioned

        Cisco refers to them as switches

        https://www.cisco.com/c/en/us/support/switches/catalyst-6500-series-switches/series.html

        Adding a FWSM made it a firewall but with limitations vs an asa

        https://community.cisco.com/t5/network-security/fwsm-vs-pix-vs-asa/td-p/734843

      2. Flightmode

        Re: 6509 was a chassis switch not a router

        The main difference between the 6500 and the 5500+RSM (apart from being different generations) was that the 6500 had an integrated RSP and was running IOS natively for BOTH the routing and switching portions. The 5500 (switch portion) ran CatOS, and you would log on to the RSM which ran IOS separately to configure the routing functionality. (Source: Have locked myself out of both models.)

        1. Anonymous Coward
          Anonymous Coward

          Re: 6509 was a chassis switch not a router

          I bet you guys are fun at parties

          1. Anonymous Coward
            Anonymous Coward

            Re: 6509 was a chassis switch not a router

            What's a party?

            1. tip pc Silver badge
              Facepalm

              Re: 6509 was a chassis switch not a router

              No LAN / MAN or WAN parties without us.

              1. The Oncoming Scorn Silver badge
                Thumb Up

                Re: 6509 was a chassis switch not a router

                Many respectable network engineers said that they weren’t going to stand for that sort of thing, partly because it was a debasement of protocol, but mostly because they didn’t get invited to those sorts of parties.

                1. Hazmoid

                  Re: 6509 was a chassis switch not a router

                  have an upvote for obligatory "Hitchhikers" reference :)

          2. Anonymous IV

            Re: 6509 was a chassis switch not a router

            What's "fun"?!

      3. DougMac

        Re: 6509 was a chassis switch not a router

        Yes, you could put various L3 Sups in the 6509. Depending on your needs, and just how fat your wallet was, you could go anywhere from basic L2 to full L3 BGP Internet routing. And since the TCAM was small and fixed in size, as the Internet routing table kept growing astronomically, so would your wallet have to to keep up by swapping out the Sup engine of the day so you could keep up.

        FWIW: the 7600 designation was the same exact chassis/cards, but marketed by a different BU at Cisco.

        If one was an ISP, and had a fat enough wallet, they'd get the 7600. If you started out as Enterprise, you'd get the 6509. Same features, options, Sup's available. Just a different badge on the front, and different sales team on the backend talking to you at Cisco.

        Cisco would laugh all the way to the bank either way.

  7. tip pc Silver badge
    Pint

    reload in, or its vendor equivalent

    When the stuff you are managing is remote youdont get the luxury of nipping across the road to fix it.

    before doing any thing, save the config.

    next do a reload in 5.

    when the inevitable happens you just wait for the box to reload into its saved config.

    juniper has its commit confirmed equivalent which just reverts the config instead of reloading the box.

    1. Yet Another Anonymous coward Silver badge

      Re: reload in, or its vendor equivalent

      Cisco has an arm that comes out and presses the reset button - like those joke boxes that press their own off switch.

  8. Huw L-D

    Clint (in all caps) can be used to get past basic word filters. Just sayin'

    1. Victor Ludorum

      Reminds me of

      the birthday cake.

  9. GeekyOldFart

    Ah, the days when you had 40 terminal windows stacked around two 17" CRTs... and being an old-school *nix guy it was focus-follows-mouse, not click-to-focus.

    The paranoia over accidentally sending the command to the wrong system was legendary. I spent several hours writing scripts and terminal config files so that dev systems were green themed, QA/test orange and production red. Used various kludges and hacks to get them to pick up when I had a root shell open and invert their colors too.

    One of my colleagues though I was being overly silly, and I admitted that I mostly did it because I wanted to find out if I could and if it made a difference (and if I could make it look cool into the bargain, compared to the "vanilla microsoft" look of the desktops belonging to people NOT admitted to the inner circles of systems administration and therefore not provided with a "real workstation" at their desks)...

    Until the day that same colleague sent a shutdown command into prod rather than dev and asked me for copies of how I did it.

  10. Giles C Silver badge

    Every network admin has done this

    And the ones who say they haven’t are either lying of have never worked on a production system.

    It is a standard interview question round here, we all know you will have done something stupid but how you behave after doing said stupid thing is what is important.

    Personally honesty is the best approach, own up before someone finds out it was you and you tried to hide it.

    1. ShortLegs

      Re: Every network admin has done this

      "It is a standard interview question round here, we all know you will have done something stupid but how you behave after doing said stupid thing is what is important."

      And just as important is the reply a candidate gives. Same league as "whats the biggest mistake you ever made".

      If a candidate replies that they have never inadvertently shutdown the wrong server / device interface Im instantly suspicious. Liekwise if they claim to have never made a mistake, they are either fibbimg, or I wonder how they would respond when they do make the inevitable mistake.

      1. Jellied Eel Silver badge

        Re: Every network admin has done this

        If a candidate replies that they have never inadvertently shutdown the wrong server / device interface Im instantly suspicious. Liekwise if they claim to have never made a mistake, they are either fibbimg, or I wonder how they would respond when they do make the inevitable mistake.

        Also Monte Carlo modelling (ok, more the fallacy) would suggest that if they haven't made a mistake like that yet, they're overdue. But it's a question I often ask candidates because it opens up a few things, like how they dealt with the mistake and maybe their honesty. Plus any lessons learned and actions to avoid repeats. Safety rules are wriiten in spilled blood and bits.

        But been there, done that and why I'm a big fan of OOB access, terminal servers and as many ways to get back into a box that's been accidently borked as possible. Especially when said boxen may be multiple timezones away. So memorable highlights have been typing 'debug all' on a rather overloaded Cisco running peering and transit for a reasonable sized chunk of the UK. Oops. Or discovering Livingstone portmasters connected to Suns sent a break to said Sun's console ports when they powered up. That one was the sysadmin's fault for not telling us they did this, and deciding they could sleep while we netengs did some upgrades that meant we had to power off & move their Livingstones. Or just any reconfiguring things like IGPs, spanning-tree or of course BGP. There are soo many ways a network can bite the hands that feed it.

        1. PB90210 Silver badge

          Re: Every network admin has done this

          I tried telling them it was silly to have dial-up modem access to the console port of a router stuck on the same line as the ADSL port of said router... did they listen...

      2. Doctor Syntax Silver badge

        Re: Every network admin has done this

        "I wonder how they would respond when they do make the inevitable mistake"

        Let them learn by their mistakes at somebody else's expense.

        1. Anonymous Custard Silver badge
          Headmaster

          Re: Every network admin has done this

          As the saying goes, the better engineer learns from their mistakes, but the best engineer learns from other peoples mistakes.

          1. Notas Badoff

            Re: Every network admin has done this

            “Only a fool learns from his own mistakes. The wise man learns from the mistakes of others.”

            ― Otto von Bismarck

      3. Giles C Silver badge

        Re: Every network admin has done this

        And in case you are wondering I have in my career….

        Powered off a comms room because I caught the emergency bypass on a ups (the switch stuck out the case by 1mm)

        Moved a server on wheels and pressed the power button (very old server where the power dropped when the button was released that spring was very strong

        Missed an option on a Cisco debug command and overloaded the router.

        Trashed the entire vlan index on every switch for a site with 80+ switches - told the boss there was a problem I had caused and I would fix it it took 12 hours to fix.

        And one of the best was driving home from an overnight job on a remote site I got a call from a colleague along the lines of I turned spanning tree off on a port and nothing works in the main data centre. To which the answer was well you are going to have to turn off one of the cores, probably both so the loop disappears wait 10 minutes and then power it back up and make sure you disconnect the port you changed. P.s. go and tell the bosses that you screwed up first. I got asked do I need to tell the bosses to which the answer was simply yes, if they don’t know about the issue know then they will when all the servers go down due to the loop in the network, and it is better they find out sooner than having to come searching for answers.

  11. s. pam
    FAIL

    Reminds me of my first FDDI testing

    when I learned the acronym really means Fall Down Drug Inducing when I tested between 2 buildings and didn't realise the fibre vendor had mis-built the connectors on one end. the server was live, but the disk network was dead...epicfail on our part!

    1. Yet Another Anonymous coward Silver badge

      Re: Reminds me of my first FDDI testing

      It's always connectors, except when it's power supplies

      1. Doctor Syntax Silver badge

        Re: Reminds me of my first FDDI testing

        Except when it's DNS.

        1. Peter Gathercole Silver badge

          Re: Reminds me of my first FDDI testing

          Come on. It's always DNS. Even when it isn't!

  12. david 12 Silver badge

    Just then the head of the IT team came in looking

    He knew when he found Clint. He was just too polite to make a point of it, and glad to go along with the myth that it was hardware fault, not one of his staff.

  13. Anonymous Coward
    Anonymous Coward

    busy hands

    What You soon realize in the past. Xmas and other long holiday periods. The networks everywhere are super stable as nobody is onsite anywhere twating around.

    Im not sure how WFH has affected things last few years as everyone is probably eager to show they are doing something.

    1. Yet Another Anonymous coward Silver badge

      Re: busy hands

      Apparently you're also much safer having a heart attack.

      The junior doc on duty is less likely to do something "clever" and just let you get better.

      1. Anonymous Custard Silver badge
        Boffin

        Re: busy hands

        From bitter experience around here, there's nothing more likely to lead to unplanned downtime than preventative maintenance...

    2. Alan Brown Silver badge

      Re: busy hands

      During the BT strikes of the 1980s they found that internal exchange faults dropped by over 90%

      That reshaped maintenance policies of telcos worldwide

  14. Chris Miller

    In my experience (strokes long grey beard) much the most common cause of failure in resilient systems is a single failure (which causes no operational problems) followed by an attempt to repair the working, not failed, device. It never happened to me (honest), but I've heard some tales ...

  15. STOP_FORTH Silver badge
    WTF?

    Making things worse

    I worked for a medium sized company (Medco Ltd).

    We were taken over by a very big company (Jumboco)

    Jumboco's HQ and labs were in a different country.

    Jumboco insisted that our nightly builds must be done on their mighty compilers in their labs. They could then be released to our greatful customers as required.

    We only had one fat pipe to the outside world, this was obviously a single point of failure and could not be countenanced by a big important company.

    Pro tip:- If you are going to dig a long trench alongside the existing cable for the new cable ensure that a) this work is performed on a Friday afternoon and b) a very important software upgrade is being released to one of your most important customers on Saturday.

    1. STOP_FORTH Silver badge

      Re: Making things worse

      Um, grateful. Apologies.

      1. Martin an gof Silver badge

        Re: Making things worse

        Silly rhyme I learned many years ago from my grandfather, and later saw in a book of puzzles for children (I think I've remembered it correctly, I mean, we're talking 1970s here, my grandfather probably learned it when he were a lad in the 1910s).

        If your grate BMT put: If your grate B. putting:

        (read out the punctuation)

        M.

        1. Anonymous Coward
          Anonymous Coward

          Re: Making things worse

          If the grate be (great B) empty, put coal on (or maybe in?) (colon); if the grate be full stop putting coal in (on?).

          This kind of wordplay is making me feel ill.

          1. The commentard formerly known as Mister_C Silver badge

            Re: Making things worse

            YY U R

            YY U B

            I C U R

            YY 4 me

            1. tfewster

              Re: Making things worse

              Spoiler alert: Read "YY" as "two Y's"

        2. STOP_FORTH Silver badge
          Happy

          Re: Making things worse

          I must admit, I had to read this a few times before I understood it. (Native English speaker here!)

          It looks a lot like code.

          Where is the "Else"?

          What language is it?

          1. Martin an gof Silver badge

            Re: Making things worse

            Yes, first time I have written it down since probably the 1980s and I began having flashbacks to Pascal at uni...

            M.

        3. swm

          Re: Making things worse

          If your grate BMT put: If your grate B. putting:

          I learned this from my grandfather too.

    2. tip pc Silver badge

      Re: Making things worse

      Pro tip:- If you are going to dig a long trench alongside the existing cable for the new cable ensure that a) this work is performed on a Friday afternoon and b) a very important software upgrade is being released to one of your most important customers on Saturday.

      That’s where you insist on diverse routing, this is where the second link shares no common path with the 1st - coming in different corner of the building - going to a different telco exchange - preferably a different service provider (I.e bt for 1st link, Vodafone fir 2nd).

      1. Yet Another Anonymous coward Silver badge

        Re: Making things worse

        And the Telco says certainly - and then leases a line from the first lot.

        1. M.V. Lipvig Silver badge

          Re: Making things worse

          You wouldn't believe how many geographically diverse circuits travel the same path end to end - some of those install ticks, I mean techs, think that it's good enough if one is on the 101/carrier/pointa/pointb and the other is on the 102/carrier/pointa/pointb. Then guess who gets yelled at when they both drop due to a fiber cut? At least I've gotten good enough at handling irate customers that the next time there's an issue they remember who I am and suddenly they're no longer mad.

          1. STOP_FORTH Silver badge
            Big Brother

            Re: Making things worse

            I can remember (more senior) colleagues discussing this in the 1980s. Presentation to us might be separate copper pairs going in different directions to different exchanges at each end of circuit. There was no guarantee that they wouldn't be multiplexed together on some PDH/SDH link somewhere in the middle.

            These were mostly "music" circuits rented from BT, plus some of that new-fangled data that would supposedly replace our venerable 75 baud telegraph.

            1. Alan Brown Silver badge

              Re: Making things worse

              "There was no guarantee that they wouldn't be multiplexed together on some PDH/SDH link somewhere in the middle."

              I made a point of putting hefty penalty clauses if this is found to be the case

              It makes them double check

              The funny part is when they backpedal and refuse to allow such clauses - If it never happens, why are they afraid of it being called out in the contract?

      2. STOP_FORTH Silver badge
        FAIL

        Re: Making things worse

        Yabbut, site was surrounded by private land, only way out was drive. Unless you wanted to start negotiating for wayleaves with neighbouring farmers.

        The real problem was that they took the compiling hardware plus the associated authentication/digital signature box. Without the sig/hash/whatever, the code wouldn't run on our proprietary hardware.

        Had they left things alone we could have couriered the signed code to another site or Internet access point.

        But they had to make things better.

        Eejits.

        1. Doctor Syntax Silver badge

          Re: Making things worse

          And did they learn anything from that?

          OK, I think I can guess.

          1. STOP_FORTH Silver badge
            Happy

            Re: Making things worse

            We were about 2-3% of their workforce. Tails don't wag dogs.

            Their e-mail storms were a thing to behold.

            Outsourced IT is obviously better!

      3. anothercynic Silver badge

        Re: Making things worse

        Yes, classic that. Happened to a former employer... they needed/wanted power redundancy. It turns out that while the power and data cables were routed differently (so some errant digger could've dug them up and we'd be ok), the power did end up being served by the same substation. Substation goes pop, pop goes the power on both ingresses (and no power means no data). Oops.

        Thankfully, current employer does have true redundancy (for power and data) on the data centres, although I am still not particularly happy having to park some critical path stuff behind firewalls (not my call) rather than switch ACLs (for our particular use case we only need a handful of ports exposed to $planet, but they must *never* go down). Redundancy is fab, until the big great firewall that was put across both ingresses has a bad moment, and you're not feeling the redundancy (yay, servers are up, but no traffic whatsoever flows, which is pointless).

        *shrug* Not my fault, just my head on the platter for the customers. So that's effectively in our risk register as something someone else needs to carry the can for.

      4. Sparkypatrick

        Re: Making things worse

        Using a second provider seems like a good idea, but I've seen cases where it turned out that somewhere along the way, both providers were using the same trench someone just cut through with a digger. Many providers will offer the option of diverse connections where they ensure that no part of the connection runs in the same physical path or on the same part of the backbone. At a price, obvs.

  16. Peter Christy

    Triple redundancy?

    Reminds me of my time many decades ago working for a major broadcaster in the UK. The Videotape department was the hub of operations, and very little TV was actually "live" - even back then. The quadruplex VTRs required an air pressure feed for the air bearings in the video heads - these things were doing around 15,000 rpm and were servo controlled to incredible tolerances, even by today's standards.

    To service the 40-odd vtrs in the basement there were three compressors feeding the air-lines. The whole area could run on two, and over half could run on just one. Triple redundancy you might think. So when I turned up for my shift one morning and found the whole area at a standstill, my initial thought was that a strike had been called - this was the early 70s, after all!

    But, no! The compressed air was down! How, you might ask? Well, the air compressors were water cooled, and the council had been digging up the road outside and fractured the main water supply into the building! No water, no compressed air, no VTRs!

    As supplied by the manufacturers, the VTRs came with their own small, but very noisy, compressors. These had been removed once the central air supply had been fitted. Fortunately, a few remained stowed away in a cupboard, somewhere, and these were frantically re-installed on three or four machines so that transmission at least could be restored.

    Aside from the thrashing of these little compressors, the department was very quiet that day.....!

    Lesson learned: Triple redundancy doesn't help when you have a single point of failure!

    1. Yet Another Anonymous coward Silver badge

      Re: Triple redundancy?

      See also:

      Checking the diesel tanks for your backup generator have diesel in them

      Checking it's not biodiesel that turned to slime years ago.

      Checking that the pumps to get the fuel to the backup generators are on backup power.

      1. Doctor Syntax Silver badge

        Re: Triple redundancy?

        If it's winter check that it;'s not summer spec. diesel.

        1. Bitsminer Silver badge

          Re: Triple redundancy?

          And nobody "borrowed" the battery from the genset to fix their car.

          1. Anonymous Custard Silver badge
            Trollface

            Re: Triple redundancy?

            And what percentage of the diesel is actually rainwater that may have leaked into the system...

      2. Dan White

        Re: Triple redundancy?

        Many years back I worked for M&S and got transferred to a store in a new out of town retail park, which featured a Sainsbury's at the opposite end of the park and the obligatory Sainsbury's petrol station.

        M&S were (at the time) known for having a procedure for *everything* that could go wrong, and a couple of months after opening it got put to the test. There was a huge amount of construction work still going on around us, and some genius managed to sever the buried 11kV line to the retail park substation.

        In M&S, the "Oh f**k we've lost power" automated sequence kicked in. Pelmet lights around the store went out, HVAC went into "Ultra-Eco mode", and crucially, the backup generator kicked in nicely, whilst also pinging Head Office to request a replacement diesel delivery ASAP.

        20 minutes later, a poor sod from Sainsbury's turned up with a 10 litre jerry can and asked if they could "borrow" some diesel for their generator as it was running out.

        "But you've got your own petrol station!?!"

        "Yeah, they didn't connect it to the generator."

        "Ah. Sucks to be you..."

        1. David Hicklin Silver badge

          Re: Triple redundancy?

          > a poor sod from Sainsbury's turned up with a 10 litre jerry can and asked if they could "borrow" some diesel for their generator as it was running out.

          The company I worked for in the late 70's/early 80's used to supply the generators for Sainsbury's, they had an 8 hour tank (if they bothered to to keep it filled up)

      3. Gene Cash Silver badge

        Re: Triple redundancy?

        Check that the tank vent isn't clogged so the generator can actually draw fuel and not stop when the tank develops a vacuum.

      4. Alan Brown Silver badge

        Re: Triple redundancy?

        Checking that the backup generator starter motor battery is being correctly float charged and that the batteries actually work

        (Happened at one site - the person in charge brushed off my questions on that specific question two years prior to the outage which revealed the issue)

    2. Prst. V.Jeltz Silver badge

      Re: Triple redundancy?

      So they needed a permanent supply of mains water to cool ?

      sounds wasteful

      1. a_builder

        Re: Triple redundancy?

        Was actually very common for have cooling water run to waste in the pre water meter days.

        A lot of big science labs had it.

        It caused a lot of issues in the late 90's when all that was banned- rightly - and chilled water became the thing.

    3. collinsl Silver badge

      Re: Triple redundancy?

      Did these VTRs also get flooded from the fountain above them? If so then that sounded like it would have been a great emergency water repository for cooling the compressors

  17. disgruntled yank Silver badge

    Halcyon days

    Has somebody on TV started using the expression? I hadn't seen it for a while, but this weekend it appeared in print in The Washington Post. I would remark that in its original use, halcyon days meant good weather around the winter solstice.

  18. biddibiddibiddibiddi Bronze badge

    Not sure I believe this one. If he was logged into the router in the building he was in, why did he have to go across the street to console into the router there to restore the connection? Even if they were in a true failover configuration with a single namespace for each pair, it would be obvious which side he was on and which interface was which. If he saw an interface that was in a fault state, why would he issue a command that would take down an interface that was NOT faulted? (I.e., if Serial0/1/0 is down, why would he issue a command to shutdown Serial0/0/0?)

    1. M.V. Lipvig Silver badge
      Trollface

      If you had read the story, you would have spotted that he made a mistake and misread what he was working on. It happens, like just now when you read the story and missed the point.

      1. biddibiddibiddibiddi Bronze badge

        "He was logged into the wrong router – the link to the other building was obviously down, so he'd been shuffled over to the router in the building he was in, without realizing it." -- TO THE ROUTER IN THE BUILDING HE WAS IN.

  19. biddibiddibiddibiddi Bronze badge

    In the early 2000s, I worked for a large ISP on the US east coast. We worked out of an office in Massachusetts. One of the engineers turned off an interface on a router in Virginia which brought down half of our backbone, and for whatever reason there was no management console remote access. Luckily we had a partner or something in the area that was able to drive there and just rebooted the whole router but it was still down for a couple of hours.

    There was also a time when something like that happened, and an engineer had to get on the next flight from MA to another state to reboot a router.

    In another instance, we started broadcasting routes over BGP for basically the entire IPv4 address space, which our peers accepted, so traffic for the entire Internet began coming to us and dying, causing outages for multiple other providers' clients on the entire eastern half of the US. Outages that take down large numbers of websites and services are sort of commonplace these days with cloud providers that go down, but that didn't happen often back then. We took it as a badge of honor that we were able to cause that much of an interruption and even made what would now be called a meme that we printed and posted in the office. (Something like "Can your ISP bring down the entire Internet? We can.")

  20. Niek Jongerius

    *sigh* - Another one of those sad people who don't have the balls to own up when they make a mistake.

    1. Robert Carnegie Silver badge

      Yeah, that's what this story category is for. https://www.theregister.com/Tag/who-me

      "Who, Me? is a weekly column in which our readers confess to catastrophes they created in the pursuit of IT excellence - and usually managed to get away with.

      "The column is a light-hearted look at the world of work and tech."

      So, "usually managed to get away with".

  21. wyatt

    Reminds me of when I was being shown round my first contracting job. Went into the 'comms/server' room and saw that 2 of the 3 ISDN BRi boxes had no lights on them. Guess the backup isn't working then!

    1. John Brown (no body) Silver badge

      I was once called out to be "a pair of hands" in a data centre to replace an HDD in a server for a major online payment processor. It was "absolutely vital" that this job must be attended within a specified fairly short time frame as the RAID array of five disks "only" had 2 hot spares. In other words, not really all that time critical! So I'm on site, chatting with their server guy on the other end of the phone having already identified the failed HDD from the blinkenlights while he talks me through the process and says he;s going to flash the LEDs. Fair enough, he doesn't know me or my skill level. Job done. So I mention that there are two PSUs for redundancy, only one has blinkenlights on. A short moment of silence followed by obvious clicky-clacky keyboard sounds and "oh shit, the monitoring alert hasn't been set up for that!" So an org in a "panic" over a failed HDD with something in the order of 3 levels of redundancy (RAID inc. parity, + two hot spares) and the real, more urgent issue is two redundant PSUs, one of which, according to the logs, he told me has been failed for 3 months.

      Sometimes it's not just a case of monitoring stuff, it also about checking that the stuff that needs to be monitored and/or have automated alerts for, is actually set up to monitor and alert :-)

  22. ShameElevator

    Damn remote desktops

    The year must have been just around 2000. I was still leaning but one task was customer support. We hosted several customers “on prem” servers. One of these was an Exchange Server. At the end of the day I had to do some work on it, so I remoted into it with Remote Desktop. I did most of the work in full screen - low budgets meant crappy computers with crappy monitors, meaning not a lot of space to work on if there where in a window. Anyway, I’m done with my work, talk with a colleague and shut down my computer, but a few seconds later I see my real desktop. My first thought was “That was odd” but then realised what I had done. Also realised the pop up I had just clocked ok to when I shut down the server, was Windows telling me that I was on a remote server and if I was really sure of I wanted to shut it down.

    I called the customer right away a told them of my blunder and ran down to the server room in the basement, where the server was still shutting down…

    From that day I leaned to take extra care and attention when doing remote work. Always double check before doing potentially disastrous actions.

    1. John Brown (no body) Silver badge
      Joke

      Re: Damn remote desktops

      "The year must have been just around 2000. I was still leaning"

      Still sobering up after the millennium celebrations?

      (Sorry, I know it's a typo, but well... :-))

    2. John Brown (no body) Silver badge

      Re: Damn remote desktops

      "From that day I leaned to take extra care and attention when doing remote work. Always double check before doing potentially disastrous actions."

      MS can shoulder some of the blame for making everyone "click happy" with so many unnecessary pop-ups with "Are you sure?" on them effectively training people to auto-click without reading.

  23. C R Mudgeon
    Facepalm

    Single point of failure

    Sometimes it's the human in the loop...

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like