I can just see...
An office chair with a note saying "now I have a network analyser" taped to it
Welcome, gentle reader, to another instalment of Who, Me? in which we cushion your entry to the working week with tales of Reg readers having worse days than you. So kick off your shoes and socks, make fists with your toes, and read on. This week meet a reader we'll Regomize as "Roy" who was contracted to a very large …
So he was operating on faulty assumptions. I don't want to be harsh, but it seems that an admin should always check that what he's doing is the right thing, otherwise mayhem might ensue.
He didn't make sure, and he paid the price (one lost XMas evening).
I'm not necessarily saying that an admin should consult the manual every time, but I would have thought that, just before sending a command that should reconfigure the entire network (not something you do every day, I guess), it might be a good thing to double-check and be sure.
Oh well, he's learned his lesson.
I'm not necessarily saying that an admin should consult the manual every time, but I would have thought that, just before sending a command that should reconfigure the entire network (not something you do every day, I guess), it might be a good thing to double-check and be sure.
sounds logical but the manual is normally thousands of pages long and often obscure language that means mistakes happen even though the manual has been consulted.
the only way to be sure is to lab it, not with all 80 switches but certainly at least 5 and see if the command does "behave" as expected.
VTP is something that sounds great but truly not.
Its like L2 domains with spanning tree everywhere, just do yourself and everyone a favour and migrate to routed links. spanning tree is the work of the devil and will kick you hard when it gets large
Failing to RTFM and causing downtime for a large environment (80 switches and the intention to move to Nexus 7Ks suggests it's likely north of 2000 end user ports) that you then have to fix was 1990s cowboy IT rather than late noughties/early twenty teens behaviour. Even StackOverflow/Serverfault would likely have had the 2-minute summary given the Nexus 7000 came out in 2008.
If you don't have time to learn then get in touch with a friendly reseller or contractor that knows what they are doing.
It will save your time for something considerably more valuable than donating it to the company as an early Christmas present.
Ah yes, I remember technical manuals like that.
"Note: configuration of the switch may optionally cause reset of all ports. If you do not want this not to unhappen, please do not refrain from not excluding the omission of the --ignoredonotexcludeomitnoreset parameter (please note that --ignoredonotincludeomitnoreset is deprecated and will be ignored)"
Most places i've been to its been disabled.
1 site we did a network replacement & it was a mess, story was a previous admin based on site went on a training course and implemented vtp when he returned, caused a huge outage and was disposed off shortly after.
the legend was no one wanted to clean up his mess so a switch replacement project later and his mess remained. Discussions of should we not just remove it now resulted in deep intakes of breath and mutterings of more than my jobs worth etc.
taking the hint i just replaced the hardware with equivalent config, the inference being it worked before i did anything & it still worked once i left so no controversy from me. The fact we knew a new outsourcer was incoming also reduced the desire to meddle.
Reason why VTP is considered a security risk is that a malicious person can impact all switch vlan config from 1 place, or add/delete/modify vlans etc. Also a lab switch with a later vland db version can amend the db's of an established environment detrimentally. It can be secured but most people didn't bother.
https://community.cisco.com/t5/switching/stay-away-from-vtp/m-p/2462239/highlight/true#M292198
Back in the days of NT4 and Lotus Notes, I was in a hardware distributor who's MD thought it would be a good idea to actually use one of the RAID boxes we sold as our data volume - you know, so we could proudly show customers one of these damn things actually doing something. Lo, it came to pass and we ended up with a set of three SCSI disks humming away in an external 4-slot RAID box. The OS and RAID monitoring was on the internal boot disk and all worked nicely for a while....until someone got an alert to say drive #2 had gone down. A junior IT bod shutdown the box down and pulled the drive from the slot labeled '2', replacing it with a fresh one. They even remembered to set the SCSI ID to '2' with the jumpers on the back of the drive.
Those that know SCSI busses - or anything much about computer systems, really - will recall that ID's tend to start at '0': What the RAID cabinet called slot '1' was, in fact, SCSI ID 0 and so on. One can only imagine the chaos that he was met with when the system was rebooted: one dead drive still in the array, one good one pulled out and replaced with a virgin unit - but with an ID conflicting with the only remaining good one...oops.
We only knew about it because when we came in the following morning, there he still was, unshaven and looking very much like a deer caught in the headlights but he just about manged to get the system stable by 9am. Those who didn't know the reason for the all-nighter called him Zorro for being such a hero. We nicknamed him Zero because never was a name so justly deserved!
Ah, the joys of "misleading" labelling - or none at all.
Many years ago I got called to a client as their Apple server had died. When I got there, the console had messages about the raid being dead (or something - it was a long time ago now).
On enquiry, it seems someone needed to plug in a USB device, so had looked behind one of the "doors" for the front panel USB sockets. When he didn't find it, he looked behind the second "door" - at which point the server stopped. What he hadn't realised, until he'd killed the server, was that these "doors" were actually the drive caddies, with a "press to pop out the handle" feature which also shutdown the drive. As he could remember what order he'd shut down the drives, I forced the 2nd one back into the array (thus giving the controller 2 of the 3 drives), and told the controller to rebuild the 1st - and all was well. And no, these servers didn't (IIRC) have any front mounted USB ports.
"Those that know SCSI busses - or anything much about computer systems, really - will recall that ID's tend to start at '0': What the RAID cabinet called slot '1' was, in fact, SCSI ID 0 and so on. One can only imagine the chaos that he was met with when the system was rebooted: one dead drive still in the array, one good one pulled out and replaced with a virgin unit - but with an ID conflicting with the only remaining good one...oops.
Been there, seen that. We were hired as a "pair of hands" to go replace a faulty HDD in a RAID on a customer site. Out of hours. Got there, security guard, surprisingly, was expecting me and showed me to the server (yes, singular. It was a while ago), only for me to note and point out to the security guard that a drive had already been pulled out. The wrong drive. So I phone the customers contact number and (also surprisingly) got straight through to their Sysadmin at HQ, told him what I'd found. After the muttered "oh shit" comment, he kindly accepted that it was their own stupid fault, no one should have touched it, would I please re-install the incorrectly removed drive, replace the actually faulty one and then leave site since even that was more than we had been contracted for and he'd deal with the fall out and try to rebuild remotely. I made the security guard wacht what I was doing, explaining in simple terms exactly what I was doing, wrote it up in the same way and got him to sign it. Extreme arse covering was most definitely the order of the day :-)
With Cisco it's always best to use the following steps:
1) 'rel in 10'
2) do the needful (set the reload time as appropriate for your reconfiguration)
3) test the results. If stuff works, 'rel can' then 'wri mem' and go home happy. If not, nervously watch the clock while you wait for the reboot (optionally grab the BOFH excuse calendar and play defense on the phone).
Learned that after borking a router that was a 4 hour ride away.
While rolling out a three cornered demo network involving large satcom hops, I messed up on a config and borked the link to the far end. Fortunately the multi service switches I used automatically rolled back after 20 minutes if you didn’t log back in and confirm the change. I went for a cup of tea, still with crossed fingers, to await the link coming back up again.
Another good tip is to back-up the running config to flash under another name like bu-date, and a copy to your local machine.
Only problem with the 'rel in 10' is the spastic auto-muscle-memory I have in doing 'wri mem' after x number of entries or cut and pastes.
"With Cisco it's always best to use the following steps:
1) 'rel in 10'
2) do the needful (set the reload time as appropriate for your reconfiguration)
3) test the results. If stuff works, 'rel can' then 'wri mem' and go home happy. If not, nervously watch the clock while you wait for the reboot (optionally grab the BOFH excuse calendar and play defense on the phone).
Learned that after borking a router that was a 4 hour ride away."
Intel once "revamped" the network during the PST night -- leaving the network utterly broken in Israel, midway through the workday. I found out the guy's home #, and he got a 2AM 'phone call.
We were told never to do it again; I expressed a pious wish he should get his balls "revamped" off.
All the comments read so far, and none bothered to mention the basic fault.
Easily since the late 2000's, most decent/large companies were not using VTP Mode Server to manage VLAN synchronization, but were using VTP Mode Transparent.
Transparent disabled VLAN synchronization and central management in favor of manual creation of VLAN's an each switch.
Primary reason why is that installing a new switch with a higher revision version and also default set to Server will quite often overwrite the rest of the connected switches. Usually setting all the switches to default to VLAN 1 as the only VLAN and the hilarity that induces.
This is sort of Level 101 of networking you learn on Day 1-2, and why Transparent is usually used once you get above a small business size.
This,
The number of System/Windows/Linux/DB Admins who think networking is 'just plugging the cable in' is staggering.
A proper network admin would have understood the pitfalls associated with using VTP and would have done sufficient research lab testing to know how to deal with the upgrade correctly.
I recently had to run some testing myself after realizing that a planned VTP change may not have achieved what we wanted, Glad I did too allowed me to come up with alternative solutions when all the lab tests failed.
If you are lucky enough to have a real network admin in your group include them in planning and troubleshooting, I have been able to resolve problems at the network level that System admins have spent months struggling with, that doesn't include all the times I was able to assist the admins in locating the root cause of a fault through either Indirect monitoring (network monitoring systems showing when a event started) or just helping them understanding how the network flow should work so pinpointing the problem.
My fun nightmare involved an OS upgrade, Cisco Nexus 7710 and the "alpha version" (they didn't call it but I sure as heck did) of Overlay Transport Virtualization. Combined with relatively new Cisco firewalls on a newer OS (read bugs not yet known) in cluster mode which was known to have "interesting problems". Said firewalls and switches ran their traffic over PIM between 2 data centers in 2 geographic locations. OTV allowed the customer to span layer 2 over layer 3 so they wouldn't have to readdress servers if they moved between data centers. Great idea in theory. Main issues came up when, for example, the "bit flipped" on the firewall without warning so that the firewall on the other data center suddenly became the master and voice traffic went from a 2 hop journey to the voice servers, to 30 hops. Surprisingly, voice doesn't work well when you suddenly introduce 28 extra hops and 150-200ms of additional latency without warning and the client had many other time sensitive applications. It took us a year to stabilize the environment after Cisco finally released a patch. Then we needed to do a code update to patch vulnerabilities on the nexus switches. The code upgrade took 4 hours and took 30 extra hours to find out why one of the VDCs was reporting that traffic was flowing through assigned interfaces but, in fact, traffic was not (sniffers proved that which were loads of fun to connect, given all the ports were fiber modules). Overall, my 70 hour software upgrade weekend, coupled with the often 90 hour weeks for a year for each person on the team, correcting all the problems, was such a fun experience.