back to article Health system network turned out to be a house of cards – Cisco cards, that is

Welcome once again, gentle reader, for that cushion into the working week that we like to call Who, Me? in which readers like your good selves entertain us with tales of times technology did not go quite right. This week our hero is someone we shall Regomize as "Tim" who moved continents some years ago, bringing his networking …

  1. Korev Silver badge
    Coat

    when the dual supervisor card was installed, it froze up harder than spit in a blizzard.

    So snow one could do any work?

    1. aerogems Silver badge
      Coat

      The network was put on ice.

      1. bemusedHorseman
        Black Helicopters

        "Jack in! BlizzardMan.EXE, Execute!"

    2. jake Silver badge

      Caused an avalanche of calls from people who thought they had been given the cold shoulder.

      1. Anonymous Coward
        Anonymous Coward

        They were certainly piste off.

    3. UCAP Silver badge

      Someone needed to just chill out.

    4. Anonymous Coward
      Anonymous Coward

      Any old <network> port in a <packet> storm?

      1. Korev Silver badge
        Coat

        I bet they have their data in Snowflake...

  2. aerogems Silver badge
    Holmes

    For those of us blessed to be ignorant of the gear in question, was it a defect in all of those 6500 devices, or just that specific unit? Because if it's the latter, I'm left wondering why the entire network needed to be redesigned as opposed to logging a warranty claim to get Cisco to send out a replacement.

    1. UCAP Silver badge

      Either (a) they were out of warranty, (b) there was no support contract with Cisco (I've run headlong into that issue once or twice), or (c) all of the previous.

    2. b0llchit Silver badge
      Boffin

      If one fault can bring down an entire network,... you need to resolve that SPOF.

      In other words, redesign.

      1. aerogems Silver badge

        Sure, in an ideal world where you're working with infinite resources. However, in most cases, compromises are made for one reason or another and there's usually multiple single points of failure in any given system. Usually the best case scenario is you can limit the number of critical points that can bring the entire system down. I mean, just for one example, what if the power goes out for an extended period of time? You know, a major substation suffers a catastrophic failure, and it'll be several days before the power company can get a replacement part in to fix it. Whole network goes down, and it's not like you can just plug everything into a different power grid. What if you need to connect two separate buildings half a kilometer apart? Rarely will you find someone sticking two independent routers at either end and running parallel cabling. Almost certainly not. Usually it'll just be one cable, or maybe a series of cables on a single switch. That cable and/or switch goes bad, and it's ruh-roh spaghetio.

        1. I could be a dog really Silver badge

          There's compromises, and there's putting all your eggs in one basket as seems to be the case here.

          True, assuming this was some time ago before it was easy to just configure multiple VPNs across the internet to mesh things, it would be costly to dual-home each remote site - but it would certainly have made sense to duplicate some links so losing the hospital hub wouldn't take out the entire network for everyone. Or even split remote sites across two hubs. Many options to reduce the impact of losing the hub.

        2. kwlf

          In answer to your first point, the hospital will have a backup generator, so the national grid being down for 6 days shouldn't be a problem.

          Healthcare IT failures are potentially life threatening - I can think of a few deaths they have contributed to - so their infrastructure should be more robust than your average online merchant. Something to aspire to.

    3. Jellied Eel Silver badge

      ..was it a defect in all of those 6500 devices, or just that specific unit?

      Probably. The good'ol Cat6500 had a number of quirks and features. One being (from memory) it was built around a good'ol PCI-

      https://en.wikipedia.org/wiki/Peripheral_Component_Interconnect

      33.33 MHz clock with synchronous transfers

      Peak transfer rate of 133 MB/s (133 megabytes per second) for 32-bit bus width (33.33 MHz × 32 bits ÷ 8 bits/byte = 133 MB/s)

      For it's bus/backplane. Which was also half-duplex. Not to mention the sup engine essentially being a very overpriced PC on a board. And then-

      The primary points on the network were two datacenters and a server room in a hospital, which was the hub for the rest of the network.

      And the rest, as they say is history. So basically in that kind of configuration, you have to be very, very careful not to overload the backplane, or do much that relied on traffic being processed by the sup engines. Although you could cram hundreds of ports into the chassis, the backplane and sup engines couldn't really handle much traffic. Cisco's always been a bit creative with it's box sizing and marketing, especially when it comes to throughput. Or security on Webex <cough>.

      But one fun thing you used to be able to do with Cats was loading a regular IOS image onto the sup engine and start running BGP. Sadly Cisco got wise to that one and stopped us using big Cats as cheap routers. But this kind of issue was a pretty common problem with people throwing more ports into 'hubs', then wondering why their Cats fell over. Cat herding was one of those things you weren't taught at Cisco's certification boot camps, but usually had to learn the hard way.

      When you hit that kind of wall though, the only real solution is to redesign the network, segment it better and hope the bean counters sign off on the tin. On the plus side, it's also a good opportunity to also improve the security.

      1. Rod.h

        So in this situation would adding more swtiches helped or would it need a different device ie router to resolve?

        1. Jellied Eel Silver badge

          So in this situation would adding more swtiches helped or would it need a different device ie router to resolve?

          Kind of. Segmenting networks so users are on their own switch can be good for both performance and security. Whether, or where to insert routers is more complex, especially given their cost compared to switches. It usually means doing some traffic analysis to figure out where traffic flows are going, and how to optimise those. So if a department's mostly talking amongst themselves and has their own file & print servers, put them on their own switch to keep that traffic local. Or switches for resilience. Issue with the Cats was really the limited backplane capacity and performanc of the sup engines, if thiings like filtering etc needed to be applied. Also not an issue unique to Cisco, ie the original Juniper BFRs were basically ATM switches with an *nix box running Junos and only a 100Mbps ethernet between brain and backplane.

          I must go find the 6509 I acquired yonks ago. ISTR there were some other fun issues, like a split-backplane version where some slots were PCI-33, others were 66Mhz so you needed to be careful with which cards went in which slots, which wasn't always obvious.

  3. Anonymous Anti-ANC South African Coward Silver badge
    Trollface

    Planned obsolescence...

    1. Mark 85

      Planned obsolescence = unplanned mayhem.

  4. TheOtherPhil
    Coat

    This week our hero is someone we shall Regomize as "Tim" who moved continents some years ago

    I was always taught that was plate tectonics...

    (Mine's the mantel with the convection currents in the pocket.)

    1. Anonymous Coward
      Anonymous Coward

      I'm still amazed that nobody has come up with "There are some who call him ... Tim."

      1. C R Mudgeon

        Indeed. I pictured him pushing them around by casting fireballs at them.

        1. TangoDelta72
          WTF?

          Very Cleese-ey response

          Who are you who can summon fire without flint or tinder?

  5. Sam not the Viking Silver badge
    Pint

    A Bit of Flanders and Swann

    Brilliant strategy and outcome:

    1 New 'expert' goes in and updates system.

    2 System failure.

    3 'Expert' intervention. Saves the day. ---->

    4 System update required.

    5 'Expert' plans system update.

    6 Go to 1.....

    Appropriately, it reminds me of 'The Gas Man Cometh': 'It was on the Monday morning, the gas man came to call.'

    1. Prst. V.Jeltz Silver badge

      Re: A Bit of Flanders and Swann

      I did wonder how he negotiated getting funding / permission for his " started on a project to redesign the network and replace every piece of network equipment."

      especially after the auspicious start.

      1. Pascal Monett Silver badge

        Well, there's nothing like proof something need to be done . . .

    2. Mooseman

      Re: A Bit of Flanders and Swann

      "It was on the Monday morning, the gas man came to call"

      On Saturdays and Sundays, they do no work at all....

  6. Anonymous Coward
    Anonymous Coward

    Catalyst 6500 story

    Years ago, I would assist the network guy to do a simple task: insert a new 24 ports cards into a Cat 6500.

    Simple, eh ?

    Yes. Except when it all goes wrong.

    As it turned out, the 6500 was recycled from multiple lives. The card slot power connector was bent (male from the chassis) and the daughter (female) card was slipped onto it and ... both really didn't like it.

    BANG ! Whole switch was fried due to a shorting of power. It took us half the day to replace it (fortunate to have one spare).

    Lesson learnt: buy kit, don't recycle them ad nauseam.

    1. Anonymous Coward
      Anonymous Coward

      Re: Catalyst 6500 story

      Wrong lesson.

      Right lesson: Don't use Cisco garbage.

      1. Prst. V.Jeltz Silver badge
        Coat

        Re: Catalyst 6500 story

        Right on! theres that many home broadband routers lying around they should be put to use instead!

  7. Michael H.F. Wilkinson Silver badge

    If something cannot possibly go wrong ...

    it will.

    The more people depend on the things not going wrong, the harder it will.

    Murphy's Laws hard at work

  8. Anonymous Coward
    Anonymous Coward

    Heading off after completion of a task

    How many of the On-Call and Who Me? columns have the "hero" heading off to get a meal, or drink, or home to sleep, only to discover that something unexpected had gone wrong and the fan is turning brown?

    Unfortunately, encouraging techies to always stay for an hour or two to confirm there are no catastrophic issues after said task was completed, is ineffectual. Only experience seems to instill that behaviour.

    And yes, been there, done that - probably too many times to admit - before I learnt that lesson.

    1. Korev Silver badge
      Pint

      Re: Heading off after completion of a task

      Unfortunately, encouraging techies to always stay for an hour or two to confirm there are no catastrophic issues after said task was completed, is ineffectual. Only experience seems to instill that behaviour.

      Au contraire... As a techie going home triggers the fault, I'd send them for a beer as soon as the work is done so the issue is triggered before the great unwashed come into work again...

      1. Excused Boots Silver badge

        Re: Heading off after completion of a task

        Schrödinger's Hardware Failure.

        Only occurs if no techie is there to observe it.

        1. Bebu
          Big Brother

          Re: Heading off after completion of a task

          《Schrödinger's Hardware Failure. Only occurs if no techie is there to observe it.》

          While the techie is there its a superposition of <failure| and <non-failure| states which as soon as the techie leaves the site either Murphy or Sod slips in and under whose malign gaze it rapidly collapses into a pure <failure| state.

          The quantum Sod's law (or Murphy's.)

          1. Joe W Silver badge

            Re: Heading off after completion of a task

            I appreciate the use of bra-ket notation. What is missing is the operator that collapses the state... but the margin of this book is too narrow for me to write it down... (i.e. I'm too lazy)

        2. CrazyOldCatMan Silver badge

          Re: Heading off after completion of a task

          Schrödinger's Hardware Failure.

          Only occurs if no techie is there to observe it.

          Or if a cat wanders near the shiny and oh-so-expensive new kit..

          Which is why I buy 2nd hand servers. Immune to the cat-effect. Which, given that there are two cats currently sleeping on top of the acoustic server case, is just as well.

      2. Jellied Eel Silver badge

        Re: Heading off after completion of a task

        Au contraire... As a techie going home triggers the fault, I'd send them for a beer as soon as the work is done so the issue is triggered before the great unwashed come into work again...

        Very true. Network support is also a bit like flying an airplane. Runways have marks on them for the 'point of no return'. Outages work much the same way. You're almost home, looking forward to bed and the pager goes off. Then pretty much regardless of time of day, you're probably going to have to fight traffic to get back and fix it.

        1. druck Silver badge

          Re: Heading off after completion of a task

          Runways don't have any such marks on them, the "point of no return" or more accurately reaching V1 seed, differs for each type of aircraft, it's take off weight and atmospheric conditions.

    2. ColinPa Silver badge

      Re: Heading off after completion of a task

      A team of us were sent to a bank in Asia to resolve a performance problem, I was an expert in one part of the solution.

      The team spotted many problems, and fixed them.

      On Friday afternoon about 3 pm after we had presented to management etc the local account rep was about to take us out for a well earned beer, when a junior person timidly put his hand up and said "the original problem is still there" Most of the team were on the 6pm flight, two of us were flying home on Sunday.

      The two of us took off our coats, and got out or laptops and worked the problem. We could see the problem, and after a couple of emails to the development lab in the US, they spotted that a number was misconfigured (number of tasks was set to 1 instead of something like 10).

      We changed the value to 10 and the code in their test system went it like a rocket.

      They had an emergency change control meeting about 11pm! and we came in next morning at 0600 while they made the change.

      The change was successful, and the bank was very happy.

      I learned that you should always check the problem is solved >before< going off for a beer!

      1. Pascal Monett Silver badge

        I say kudos to the "junior person".

        Apparently, he was the only one to control that the job had had the expected outcome.

        Sorry, but it doesn't say much of all the "experts".

        That said, I learned that lesson the hard way as well . . .

      2. Munchausen's proxy
        Pint

        Re: Heading off after completion of a task

        I learned that you should always check the problem is solved >before< going off for a beer!

        Yes, you never fix THE bug, the best you can do is fix A bug (repeat as necessary).

    3. Stuart Castle Silver badge

      Re: Heading off after completion of a task

      The problem is a lot of systems seem to have some sort of technician detector. As soon as the tech leaves, the system fails.. I've had that many a time. Even waiting behind for an hour or two doesn't seem to help always.

      1. Anonymous Coward
        Anonymous Coward

        Re: Heading off after completion of a task

        And the opposite.

        Quite often the fault will miraculously disappear the moment the techie sits down without touching a keyboard.

        Happened to me several times.

        1. Diogenes

          Re: Heading off after completion of a task

          I teach ICT. Its amazing how many 'faults' are solved by me just laying hands on the machine.

          1. VicMortimer Silver badge
            Pint

            Re: Heading off after completion of a task

            These days, I can solve a lot of them with a phone call.

            The act of me answering the phone resolving the fault has happened an insane number of times.

        2. CrazyOldCatMan Silver badge

          Re: Heading off after completion of a task

          Quite often the fault will miraculously disappear the moment the techie sits down without touching a keyboard.

          Happened to me several times.

          Likewise..

          "But it wasn't working 5 minutes ago!" has featured a fair bit in my long and inglorious support career..

    4. C R Mudgeon

      Re: Heading off after completion of a task

      In my experience, the go-live often takes place after far too much overtime, not just on the day but in the days/weeks/months leading up to it -- in which case, people can be tired and/or approaching burnout.

      Corollary: the willingness to stick around for a while is especially low -- at a moment when the risk of exhaustion-related mistakes makes the need to stick around even higher.

  9. Terry 6 Silver badge

    There might be a reason

    As a general rule, and I assume its the same with network stuff, if an obvious or well known job hasn't been done, before you crack on and do it you should investigate why it wasn't done.

    Because if there's a paper trail and you screw up by not following it it's your fault.

    1. chivo243 Silver badge
      Facepalm

      Re: There might be a reason

      Yeah, I'm the guy that tried to tell them not to do it, smarter techs have tried and failed. While being quietly ignored I might add...

  10. phuzz Silver badge
    Unhappy

    I had a similar "this should be fine, ohshit" moment, moving power connections on an HP blade enclosure. It could hold up to eight hot-swappable power supplies, this one had six, but IIRC could operate on as few as two with the small number of blades in this particular unit. I'd already moved the power lead of one PSU to a different UPS, so I wasn't expecting anything different when I pulled the kettle lead out of the next one down. Instead there was a click, and the entire blade enclosure powered down, taking with it several important servers. Cue my boss charging into the server room asking what I'd done.

    After some testing, it turned out that one of the PSUs seemed ok, right up until it had to draw any significant load, whereupon it would completely fail. If the enclosure had chosen to spread the load onto a different PSU we'd never had noticed, it was just sheer chance it picked the bad one.

    1. Korev Silver badge
      Trollface

      I had had one of those power supplies emit the classic blue spoke which then caused the circuit breaker to crap itself. We then saw what gear was correctly cabled and which wasn't...

      My face when all my gear stayed up -->

    2. collinsl Silver badge

      You must have had it set in "efficiency" mode then - there are configuration options with the C7000 enclosure which allow you to choose how power is spread out. The "efficiency" option (I can't remember the actual names now, been years since I touched an enclosure) used as few PSUs as possible to bear the load so that they operated with as little efficiency loss as possible, whereas there were some other options to balance load across X PSUs or all PSUs (can't remember exactly) in order to maintain as much redundancy as possible.

  11. Stuart Castle Silver badge

    This is sort of related..

    A few years ago, when I first started learning Mac administration, I was given an old Mac Pro and a copy of OSX Server to put on it. I had a lab full of Macs to manage, and so I installed a few admin tools on this Mac, using them to administer the lab Macs. They authenticated agains Active Directory, so I didn't have to worry about access control.

    Part of my job was re-imaging the machines at least once a year (or if needed due to corruption/drive failure etc), and I'd set up a fairly reliable system that used the now sadly departed freeware "Deploystudio" to deploy the base OS, and Munki to deploy the application (Munki provides a sort of internal app store, but can be set to install software automatically).

    This system was pretty much automatic, but did require that we run round with boot usbs, and click a few buttons to start the deployment.

    So, looking for a way to simplify this, and having read about netbooting (for the Windows heads, Netbooting is essential PXE), I clicked the option to turn it on, disabling DHCP in the process (we had an existing DHCP server on the network, I didn't want to put a second on the network and risk serious network issues..

    Half an hour later, I got a visit from a friend. Our Networks team had noted that there was network disruptiion, and noticed a new machine that appeared to be trying to be a DHCP server. They gave him the IP and told him to remove it.

    I showed him the setup, but while he agreed that it the machine did not look like it was set up to serve IP addresses, he had to remove it from the network until we'd removed any trace of the DHCP server.

    The sad thing is, had I been given a little more time, and access to our IP database, I could have found a solution that would have not involved macOS setting up it's own DHCP server, and would have allowed us to use Microsoft's Deployment system for windows, and Netboot for the Macs.

  12. -v(o.o)v-

    OIR...

    ...also known as Online Insertion and Reboot.

    The cards need to be pushed in/pulled out just right and it might work.

    P.s. still rocking some 7600s here. They double as space heaters.

  13. myootnt

    A long time ago in a land far, far away...

    I learned, much as Tim learned, that when you have a working system that isn't quite as it should be, it was probably brought to working order by someone that knew what they were doing and left it that way because the proper implementation caused a problem and nobody wanted to pay to bring it up to snuff. Although, there should have been a pair of 6500s at the core. Who the heck built a network that uses 6500s and doesn't deploy pairs them in a redundant fail-over configuration. They don't even mention carrier redundancy.

  14. Richard 111
    Pint

    Did Tim go to Hortons for his doughnut?

    Can't find a doughnut icon, so beer it is.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like