back to article Network died, hard, during company Christmas party, leaving lone techie to fix it

Welcome, gentle reader, to another instalment of Who, Me? in which we cushion your entry to the working week with tales of Reg readers having worse days than you. So kick off your shoes and socks, make fists with your toes, and read on. This week meet a reader we'll Regomize as "Roy" who was contracted to a very large …

  1. diver_dave

    I can just see...

    An office chair with a note saying "now I have a network analyser" taped to it

    1. jmch Silver badge
      Happy

      Re: I can just see...

      ...ho ho ho

  2. b0llchit Silver badge

    The marvels of VLANs on Cisco switches. Still having nightmares about switches deleting their entire VLAN database and entering all ports in VLAN1.

    1. Ozan

      I have nightmares where I work in IT instead of construction.

  3. Pascal Monett Silver badge

    "for VTP to work, the command had to be sent to each of the switches"

    So he was operating on faulty assumptions. I don't want to be harsh, but it seems that an admin should always check that what he's doing is the right thing, otherwise mayhem might ensue.

    He didn't make sure, and he paid the price (one lost XMas evening).

    I'm not necessarily saying that an admin should consult the manual every time, but I would have thought that, just before sending a command that should reconfigure the entire network (not something you do every day, I guess), it might be a good thing to double-check and be sure.

    Oh well, he's learned his lesson.

    1. tip pc Silver badge
      Holmes

      Re: "for VTP to work, the command had to be sent to each of the switches"

      I'm not necessarily saying that an admin should consult the manual every time, but I would have thought that, just before sending a command that should reconfigure the entire network (not something you do every day, I guess), it might be a good thing to double-check and be sure.

      sounds logical but the manual is normally thousands of pages long and often obscure language that means mistakes happen even though the manual has been consulted.

      the only way to be sure is to lab it, not with all 80 switches but certainly at least 5 and see if the command does "behave" as expected.

      VTP is something that sounds great but truly not.

      Its like L2 domains with spanning tree everywhere, just do yourself and everyone a favour and migrate to routed links. spanning tree is the work of the devil and will kick you hard when it gets large

      1. Anonymous Coward
        Anonymous Coward

        Re: "for VTP to work, the command had to be sent to each of the switches"

        Failing to RTFM and causing downtime for a large environment (80 switches and the intention to move to Nexus 7Ks suggests it's likely north of 2000 end user ports) that you then have to fix was 1990s cowboy IT rather than late noughties/early twenty teens behaviour. Even StackOverflow/Serverfault would likely have had the 2-minute summary given the Nexus 7000 came out in 2008.

        If you don't have time to learn then get in touch with a friendly reseller or contractor that knows what they are doing.

        It will save your time for something considerably more valuable than donating it to the company as an early Christmas present.

        1. Robert Carnegie Silver badge

          Re: "for VTP to work, the command had to be sent to each of the switches"

          This is "Who, Me?" after all. A library of usually unintended, usually unattributed consequences. What you shouldn't have done.

      2. Zippy´s Sausage Factory
        Devil

        Re: "for VTP to work, the command had to be sent to each of the switches"

        Ah yes, I remember technical manuals like that.

        "Note: configuration of the switch may optionally cause reset of all ports. If you do not want this not to unhappen, please do not refrain from not excluding the omission of the --ignoredonotexcludeomitnoreset parameter (please note that --ignoredonotincludeomitnoreset is deprecated and will be ignored)"

    2. DS999 Silver badge
      Facepalm

      He had 80 of them

      Before they were installed he didn't do a mini lab test involving three of them?

      1. chivo243 Silver badge

        Re: He had 80 of them

        We called it POC or proof of concept, and just because it worked in the proof, doesn't mean it works in production. Walked more than a mile in those shoes!

        1. Anonymous Coward
          Anonymous Coward

          Re: He had 80 of them

          POC is soooo last century...

          You need to be agile!

  4. Anonymous Coward
    Anonymous Coward

    VTP often considered a security risk

    Most places i've been to its been disabled.

    1 site we did a network replacement & it was a mess, story was a previous admin based on site went on a training course and implemented vtp when he returned, caused a huge outage and was disposed off shortly after.

    the legend was no one wanted to clean up his mess so a switch replacement project later and his mess remained. Discussions of should we not just remove it now resulted in deep intakes of breath and mutterings of more than my jobs worth etc.

    taking the hint i just replaced the hardware with equivalent config, the inference being it worked before i did anything & it still worked once i left so no controversy from me. The fact we knew a new outsourcer was incoming also reduced the desire to meddle.

    Reason why VTP is considered a security risk is that a malicious person can impact all switch vlan config from 1 place, or add/delete/modify vlans etc. Also a lab switch with a later vland db version can amend the db's of an established environment detrimentally. It can be secured but most people didn't bother.

    https://community.cisco.com/t5/switching/stay-away-from-vtp/m-p/2462239/highlight/true#M292198

    1. DJV Silver badge

      Re: "was disposed off shortly after"

      Was quicklime involved?

      1. Doctor Syntax Silver badge

        Re: "was disposed off shortly after"

        There was one stairwell nobody wanted to use after the cleaners gave notice.

  5. Anonymous Anti-ANC South African Coward Silver badge
    Devil

    Well done on spinning Nakatomi in.

    "I now have root access. Ho Ho Ho".

    1. FrogsAndChips Silver badge
      Thumb Up

      and well done to the Regomizer for coming up with 'Roy' instead of more obvious names like 'Bruce'.

      1. Marty McFly Silver badge
        Pint

        "I really liked those sequin shirts."

    2. Benegesserict Cumbersomberbatch Silver badge
      Boffin

      Here was me thinking Roy upgrading Nexus models was going to be a segue into a Blade Runner reference.

      I've seen things you people wouldn't believe...

      (Icon looks like Eldon Tyrell if you hold it up like this)

      1. FrogsAndChips Silver badge

        You mustn't have paid attention to the article title, then.

        1. Anonymous Coward
          Anonymous Coward

          You mean article titles are informative, they aren't just there to lure us in before the old bait and switch?

          That doesn't sound like any Internet I've come across!

  6. FrogsAndChips Silver badge
  7. mobailey

    Yippee ki-yay, motherboard.

  8. Ball boy Silver badge

    Not checking is suicidal

    Back in the days of NT4 and Lotus Notes, I was in a hardware distributor who's MD thought it would be a good idea to actually use one of the RAID boxes we sold as our data volume - you know, so we could proudly show customers one of these damn things actually doing something. Lo, it came to pass and we ended up with a set of three SCSI disks humming away in an external 4-slot RAID box. The OS and RAID monitoring was on the internal boot disk and all worked nicely for a while....until someone got an alert to say drive #2 had gone down. A junior IT bod shutdown the box down and pulled the drive from the slot labeled '2', replacing it with a fresh one. They even remembered to set the SCSI ID to '2' with the jumpers on the back of the drive.

    Those that know SCSI busses - or anything much about computer systems, really - will recall that ID's tend to start at '0': What the RAID cabinet called slot '1' was, in fact, SCSI ID 0 and so on. One can only imagine the chaos that he was met with when the system was rebooted: one dead drive still in the array, one good one pulled out and replaced with a virgin unit - but with an ID conflicting with the only remaining good one...oops.

    We only knew about it because when we came in the following morning, there he still was, unshaven and looking very much like a deer caught in the headlights but he just about manged to get the system stable by 9am. Those who didn't know the reason for the all-nighter called him Zorro for being such a hero. We nicknamed him Zero because never was a name so justly deserved!

    1. I could be a dog really Silver badge

      Re: Not checking is suicidal

      Ah, the joys of "misleading" labelling - or none at all.

      Many years ago I got called to a client as their Apple server had died. When I got there, the console had messages about the raid being dead (or something - it was a long time ago now).

      On enquiry, it seems someone needed to plug in a USB device, so had looked behind one of the "doors" for the front panel USB sockets. When he didn't find it, he looked behind the second "door" - at which point the server stopped. What he hadn't realised, until he'd killed the server, was that these "doors" were actually the drive caddies, with a "press to pop out the handle" feature which also shutdown the drive. As he could remember what order he'd shut down the drives, I forced the 2nd one back into the array (thus giving the controller 2 of the 3 drives), and told the controller to rebuild the 1st - and all was well. And no, these servers didn't (IIRC) have any front mounted USB ports.

    2. Hans 1
      FAIL

      Re: Not checking is suicidal

      xkcd

    3. John Brown (no body) Silver badge

      Re: Not checking is suicidal

      "Those that know SCSI busses - or anything much about computer systems, really - will recall that ID's tend to start at '0': What the RAID cabinet called slot '1' was, in fact, SCSI ID 0 and so on. One can only imagine the chaos that he was met with when the system was rebooted: one dead drive still in the array, one good one pulled out and replaced with a virgin unit - but with an ID conflicting with the only remaining good one...oops.

      Been there, seen that. We were hired as a "pair of hands" to go replace a faulty HDD in a RAID on a customer site. Out of hours. Got there, security guard, surprisingly, was expecting me and showed me to the server (yes, singular. It was a while ago), only for me to note and point out to the security guard that a drive had already been pulled out. The wrong drive. So I phone the customers contact number and (also surprisingly) got straight through to their Sysadmin at HQ, told him what I'd found. After the muttered "oh shit" comment, he kindly accepted that it was their own stupid fault, no one should have touched it, would I please re-install the incorrectly removed drive, replace the actually faulty one and then leave site since even that was more than we had been contracted for and he'd deal with the fall out and try to rebuild remotely. I made the security guard wacht what I was doing, explaining in simple terms exactly what I was doing, wrote it up in the same way and got him to sign it. Extreme arse covering was most definitely the order of the day :-)

  9. Anonymous Coward
    Anonymous Coward

    Cisco survival

    With Cisco it's always best to use the following steps:

    1) 'rel in 10'

    2) do the needful (set the reload time as appropriate for your reconfiguration)

    3) test the results. If stuff works, 'rel can' then 'wri mem' and go home happy. If not, nervously watch the clock while you wait for the reboot (optionally grab the BOFH excuse calendar and play defense on the phone).

    Learned that after borking a router that was a 4 hour ride away.

    1. FirstTangoInParis Bronze badge

      Re: Cisco survival

      While rolling out a three cornered demo network involving large satcom hops, I messed up on a config and borked the link to the far end. Fortunately the multi service switches I used automatically rolled back after 20 minutes if you didn’t log back in and confirm the change. I went for a cup of tea, still with crossed fingers, to await the link coming back up again.

      1. tip pc Silver badge

        Re: Cisco survival

        Sounds like comit confirmed in juniper land.

        Does a rollback without restarting

        I’ve never understood why Cisco never offered such an option

    2. BILL_ME

      Re: Cisco survival

      Another good tip is to back-up the running config to flash under another name like bu-date, and a copy to your local machine.

      Only problem with the 'rel in 10' is the spastic auto-muscle-memory I have in doing 'wri mem' after x number of entries or cut and pastes.

      "With Cisco it's always best to use the following steps:

      1) 'rel in 10'

      2) do the needful (set the reload time as appropriate for your reconfiguration)

      3) test the results. If stuff works, 'rel can' then 'wri mem' and go home happy. If not, nervously watch the clock while you wait for the reboot (optionally grab the BOFH excuse calendar and play defense on the phone).

      Learned that after borking a router that was a 4 hour ride away."

    3. Mayday
      Alert

      Re: Cisco survival

      Reload in 10 won’t save a VTP fuckup.

      Vlan database stuff is independent on the running/startup configs.

  10. G.Y.

    Intel once "revamped" the network during the PST night -- leaving the network utterly broken in Israel, midway through the workday. I found out the guy's home #, and he got a 2AM 'phone call.

    We were told never to do it again; I expressed a pious wish he should get his balls "revamped" off.

  11. BILL_ME

    So, the difference between an 'admin' and a basic networking guy.

    All the comments read so far, and none bothered to mention the basic fault.

    Easily since the late 2000's, most decent/large companies were not using VTP Mode Server to manage VLAN synchronization, but were using VTP Mode Transparent.

    Transparent disabled VLAN synchronization and central management in favor of manual creation of VLAN's an each switch.

    Primary reason why is that installing a new switch with a higher revision version and also default set to Server will quite often overwrite the rest of the connected switches. Usually setting all the switches to default to VLAN 1 as the only VLAN and the hilarity that induces.

    This is sort of Level 101 of networking you learn on Day 1-2, and why Transparent is usually used once you get above a small business size.

    1. Anonymous Coward
      Anonymous Coward

      Re: So, the difference between an 'admin' and a basic networking guy.

      This,

      The number of System/Windows/Linux/DB Admins who think networking is 'just plugging the cable in' is staggering.

      A proper network admin would have understood the pitfalls associated with using VTP and would have done sufficient research lab testing to know how to deal with the upgrade correctly.

      I recently had to run some testing myself after realizing that a planned VTP change may not have achieved what we wanted, Glad I did too allowed me to come up with alternative solutions when all the lab tests failed.

      If you are lucky enough to have a real network admin in your group include them in planning and troubleshooting, I have been able to resolve problems at the network level that System admins have spent months struggling with, that doesn't include all the times I was able to assist the admins in locating the root cause of a fault through either Indirect monitoring (network monitoring systems showing when a event started) or just helping them understanding how the network flow should work so pinpointing the problem.

  12. Flat Phillip
    Mushroom

    VTP

    As soon as I read VTP I knew they were in for a bad night.

    I'm not saying it always goes bad but touching VTP will generally burn you, especially across switch families or with older IOS.

  13. jollyboyspecial

    To summarize Roy didn't plan properly and performance was piss poor? Pillock!

  14. Mikerahl

    My fun nightmare involved an OS upgrade, Cisco Nexus 7710 and the "alpha version" (they didn't call it but I sure as heck did) of Overlay Transport Virtualization. Combined with relatively new Cisco firewalls on a newer OS (read bugs not yet known) in cluster mode which was known to have "interesting problems". Said firewalls and switches ran their traffic over PIM between 2 data centers in 2 geographic locations. OTV allowed the customer to span layer 2 over layer 3 so they wouldn't have to readdress servers if they moved between data centers. Great idea in theory. Main issues came up when, for example, the "bit flipped" on the firewall without warning so that the firewall on the other data center suddenly became the master and voice traffic went from a 2 hop journey to the voice servers, to 30 hops. Surprisingly, voice doesn't work well when you suddenly introduce 28 extra hops and 150-200ms of additional latency without warning and the client had many other time sensitive applications. It took us a year to stabilize the environment after Cisco finally released a patch. Then we needed to do a code update to patch vulnerabilities on the nexus switches. The code upgrade took 4 hours and took 30 extra hours to find out why one of the VDCs was reporting that traffic was flowing through assigned interfaces but, in fact, traffic was not (sniffers proved that which were loads of fun to connect, given all the ports were fiber modules). Overall, my 70 hour software upgrade weekend, coupled with the often 90 hour weeks for a year for each person on the team, correcting all the problems, was such a fun experience.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like