Not so much a "Who, me?" as a "Who, you?"
It's a new week and that means a fresh tale of close calls and proper falls from those who should know better in another of The Register's regular Who, Me? columns. This week's confessional comes courtesy of "Steve", and takes us back to the 1990s and a well-known UK bank. The office, he recalled, was an open-plan space the …
This post has been deleted by its author
When I worked for Xerox we had two DHCP servers serving the same machines. Whichever one got there first served the required data. Seemed to work but this was back in the days when Xerox had a class A network (13.*.*.*) so no one was conserving IP addresses.
Ever have someone somehow shove a lightning cable in a USB port?
Not fun to fix that forced reassignment.
There's too many goes-into ports around to even call it phone gender.
Mine's the one with the DE-9 gender benders and DE-9:DB-25 size converters in the pocket.
This post has been deleted by its author
"Labels people, and read them!"
presumably everything in the Dev comms room was almost fair game whilst the live room had stricter change control?
I also assume the Dev & Live comms rooms where appropriately signed.
Lastly, a big organisation should have had more than 1 DHCP server & PC's typically retain their DHCP addresses, only checking for a new address at half the lease time, for that reason we used to set DHCP at 4 days so machines could still work on a Monday from Friday's lease giving time to fix any issues from a weekend fault.
You can do it on the same subnet in an SME with something like pfSense doing your DHCP. You run two in parallel. Then use CARP to failover when things go haywire.
That's how I've done it. But there's a load of ways to do it. Search for DHCP failover.
p.s. Icon for BSD! --->
It's generally preferable to do application level clustering (as done with ISC's dhcpd server). But yes having two is generally a good idea especially if you insist on using (now) older versions of Windows for doing it with.
Mind you, I would say that as that's what I've been doing for over a decade.
If the DHCP server (windows, 'nix, etc.) doesn't support HA, you split the damn scope, like the best practices state. It's not that hard.
FWIW, Server 2012 R2's DHCP server does support HA, although it takes a bit of effort to set it up.
And also. LABEL THY PRODUCTION SERVERS. Even if it means you print a label out of paper and cellotape it to the front of the server.
> you split the damn scope, like the best practices state.
Show the best practice document that curses. And you never had the situation where a client changes it's IP every minute because it constantly jumped between the two DHCP servers which were active at the same time but serving different parts of the same scope (as you describe BTW).
And how do you prevent two to be active at the same time with NT 4.0? Oh, yes, scripting!
And of course I know that 2012 R2 does support HA DHCP (without clustering), we use it all the time since it exists.
And screaming about labeling TO ME SINCE YOUR REPLY TO MY COMMENT WITHOUT KNOWING WHO YOU ARE TALKING TO you haven't seen my label style. Labeling servers is not enough, label LAN cables, label electric cords (Server name and UPS it runs to, on both ends), label holes on a patch with a special function (VLAN for development for example)...
Honestly: If you reply only reply to what someone wrote and don't assume a whole lot of other nonsense which is not written there out of your false prejudice.
And you never had the situation where a client changes it's IP every minute because it constantly jumped between the two DHCP servers which were active at the same time but serving different parts of the same scope
They won't unless something is seriously wrong. A standards compliant DHCP client will NOT switch servers (and hence IP pool) unless it's "home" server is offline. As the lease runs down, the client will unicast a renewal request to the server from which it got it's lease - the other server will not get a look in as it won't even see the packet. If your lease times are reasonably long (IIRC, Windows defaults to something like 8 days) then most clients will simply renew their lease at boot-up in the morning and then do nothing during the rest of the day. Some "not very sticky" clients may switch pool at this point - by simply broadcasting for any lease, rather than requesting their previous lease.
Windows clients go further, and are very very sticky about their leases - which is itself a PITA at times.
"LABEL THY PRODUCTION SERVERS. Even if it means you print a label out of paper and cellotape it to the front of the server."
THIS is why I insist on labellers being in the server room with supplies of appropiate tape.
If the labelmaking kit is right there, there's no excuse not to do it (Brother TZ labels work out at about 5p each, so not exactly bank-breaking.)
There are no issues with multiple DHCP servers either on the same subnet or serving the same subnet (via multiple DHCP helper addresses).
BUT (and I suspect this was your point) they MUST not serve overlapping address pools to avoid duplicate IP address assignments.
And for configuration) management purposes, application level clustering for your DHCP service rather than multiple independent servers is less likely to result in configuration SNAFUs
>>A big organisation should have had more than 1 DHCP server
> But not on the same subnet, otherwise arguments happen.
Why would you be assigning random IPs to MACs that just happen to appear on your network? Security FAIL right there.
If you have IPs assigned in dhcpcd.conf then everything works - and rogue boxes don't get an IP address.
Of course if you have 802.1x, they don't even get onto the vlan in the first place
While reading I suspected something differently, but that must be me being biased with prior experience of less well managed environments: a newfangled machine starting its life in development or test and then miraculously being repurposed to production - obviously without ever being re-staged/moved/re-labelled as such.
Yes, I fscked this up myself on a small scale, not worthy of such a column. And saw it from save distance on a larger scale, too.
Well, yes and no.. In the office I was working on the Y2K thingy the IT and comms room was properly bolted down requiring a keycard and pin for access and no one there (we were all analysts and developers) had access.
One day some higher up realized that, just to change the daily backup tape, an IT minion had to do a daily trip from the main offices across town for that 1-minute job.
I was then chosen to do the job, so got access to the forbidden room. But I guess my bosses gathered I would never even think about taking any Cisco switch for a walk...
Had a similar thing happen to me in the early 2000s. One of the teams in our building was moved out to head office, about 100 miles away (our office was basically the IT hub).
But they still had gear in the single server room we had, which had a daily backup tape that needed rotating out. I got he job for a short while.
Since you were the only "local" with access, they could be pretty sure you wouldn't touch anything (assuming you had been around the business long enough to know that you'd get the blame for any problem remotely connected to the room).
Not quite the same, but I worked in an environment where some control hardware was being developed alongside the monitoring software.
Failing to get a response from the test unit usually resulted in a walk down to the engineering workshop to see if the device was actually powered up and connected to the network.
Usually it was simply switched off, but sometimes you would find the safety cage open with bits of the test rig lying on a bench.
If there were one or more boffins prodding it, tutting, and shaking their heads, it was time for an extended coffee break.
>>Not quite the same...
Instead of missing DHCP, I've experienced conflicts on more than one occasion when new hardware appears. I especially remember one that was my fault.
I had put a spare HP switch in my workspace so I could communicate with extra devices. Calls came in from users having logon problems the next day. I finally discovered that this model HP switch included an DHCP server which had been enabled. I had used the same switch without incident for years before. It didn't occur to me to check switch configurations back then. Since then.... check everything first. You just never know who might have used gear before you, what for...
I've had users do this at mulitple companies. I also found that if you have an assistant, its better all around if you have your assistant disable the port (and the 2nd port they move their dhcp server to, and the 3rd port,( it was a loong hallway)) while you walk down the hall to stop them from putting it back on the network on yet another port.
> [users firing up dhcp servers] I also found that if you have an assistant, its better all around if you have your assistant disable the port (and the 2nd port they move their dhcp server to, and the 3rd port
I found it's better to have a switch smart enough to notice unauthorised DHCP servers and shut the port down all by itself. Also smart enough to notice (l)users who've decided to manually configure their system - with the IP address of the network gateway - and do the same thing, only in that case it STAYS disabled until we "have a chat" with them. Ditto if a port suddenly starts seeing multiple MACs (unauthorised switch, DoS attack, rogue VMs and in one case unauthorised wireless bridging)
Dickheads who wander around a lab causing every port to get shut down for security violations tend to find themselves rather unpopular with everybody else trying to work.
You need a server running Zabbix (my choice) or some other monitoring software, e.g. Nagios. When someone unplugs a monitored server, within a few minutes you've got alerts, XMPP messages coming through, whatever, and you can fix it before anyone notices, most of the time. It also tells you when the disk is getting full, when it's getting too hot, when the RAID is degraded, etc. etc.
Just the other day, I found a server running cryptocurrency mining software in a user account because the CPU was constantly at 85C...
"You need a server running Zabbix "
"Just the other day, I found a server running cryptocurrency mining software in a user account"
doesn't matter what monitoring software you have, there's a bunch of other stuff you need to be doing to make sure that miscreants aren't running malicious code on your systems. What other stuff have you not found?
Monitoring is a good (practically essential) start though, especially when your environment is too big for any one person to know what 'should' be going on with every device.
I've not spotted crypto-mining yet, but I've spotted servers filling their discs, which turned out to be something writing debug-level logs because the developer forgot to switch them off.
Another vote for Zabbix though. It can monitor practically anything with a network connection, and it's configurable seven ways from Sunday.
I used to work at a tiny firm that did, amongst other things, live video streaming. The chap who was mostly responsible for the video servers acted as though security was against his religion.
We found on at least one occasion football matches being pushed to and streamed from those servers.
I'm glad I don't work with that guy anymore. Moron.
Had an DHCP issue this morning myself.
For some reason ClearOS's DHCP function decided to disable itself after a reboot following a power failure. Enabled it after users bleated about loss of wifi and network access....
Now all is well.
Strange how such a small thing can cause big issues...
Some time ago, 1999 I think, where I worked we had what was called the Customer Data Interchange server, or CDI as it was called back then. Basically a small integration platform, running on AIX, managing incoming data from customers, mostly dial-up at the time, using UUCP and Kermit, we also had a few leased line connections with larger clients, although no Internet connections back then (they arrived in 2001).
Peak times were late afternoons during the week, but we did have a little bit of data over the weekends from some of our larger clients. One of these larger clients had tried using the service one Sunday, around lunchtime, to no avail.
I was the lucky one providing call out that weekend, we had a shared laptop and a pager, that was handed over every Tuesday to the next person on call. The pager went off that Sunday lunch time. I called the Unix Ops team, who paged me and who were in the office 24/7, and asked what's up, "Client X can't get any data through, can you have a look?". "Okay" I say.
I dialled in from home (we had a modem rack for remote terminal access), got onto our jump box, and then tried to access the CDI server, it timed out. Tried various network tools, no response to ping etc. The CDI server was an AIX box that basically just kept going, 24/7, I'd never known it once to actually crash or freeze up. So I'm thinking maybe a hardware issue, or network problem.
So I called Unix ops again..
Me: "Hi, anything going on in the DC today?"
Ops: "Yes, there was someone scheduled in this morning to decommission some old unused gear. Why?"
Me: "Are they still there, and could they go check what was actually decommissioned, specifically anything related to CDI?"
Ops: "ok, I'll call you back in a few".
There was me thinking maybe someone had pulled out one too many network cables or something.
30 minutes later, the phone rings, it ops, "Hi, did you say CDI?".
Me: "Yes why?".
Ops: "Well the guy doing the work left the building about an hour ago after finishing the work, and isn't answering his pager (turned out later it was turned off). But we got one of the security guys (the only other people on site) to go have a look, and they found a box by the back door, near the skips, with a label on it, saying 'CDI'"
Me: !!!! "Ah, that could be an issue!"
Turns out of course, the guy doing the decommission work had decommissioned one two many servers. Our CDI server was ancient, and was due for replacement the following year. Turned out everything else in that section of the DC (builtin the early 80s I believe), was being decommissioned, and he'd basically just removed the lot, including our active server.
Some panicked calls from Ops trying to get hold of someone else who could help. They did manage to get someone, who then had to travel to the DC, carry the box back in from outside (thankfully it hadn't been raining!), connect everything back up, and just hope for the best when the power button was pressed.
I got another page late afternoon, spoke to Ops, who said the server was up and running, and could I check please. I dialled-in again, had a look around, did some housekeeping, and sure enough, everything seemed to be working fine.
To the credit of whoever it was who went in that afternoon, despite not being involved in the removal, they went in, and managed to get everything hooked back up and working, and stayed on site while I checked things out. He apparently also put a big label on the front of the box to state that it was a Live service, and not to remove it without getting clearance from my team!
Needless to say, we did a lot of manual monitoring for the next few days to make sure everything was running fine, and I was also involved in a few lessons-learned meetings, which changed a few of the processes we had (or just created new ones as they didn't exit yet!). Including for example, requiring anyone doing any out of hours work at the DC, to be available on call for the next 24 hours minimum. If they couldn't do the on call cover, they weren't allowed to do the work.
~ 2010 i was on the phone with cisco support regarding a 6512r we had recently installed and running in a contact centre. It was up and taking calls but we had some issue iirc to do with ACL's consuming cpu instead of running in the ASIC. sh tech and logs back and forth to Cisco then a lot of webex's. The cisco engineer suggested we up the code to a recent (released after we had issues) version and we planned to install it on the redundant supervisor. at the start of the call i stated the switch was live taking customer revenue generating calls, during our troubleshooting i reiterated the same, before we started the upgrade on the redundant supervisor i repeated the same & that we will do the switch over out of hours, once the upgrade was done i repeated the same & i'd do the switch over over night, she then flipped the supervisors causing both supervisors to boot and all calls to drop & phones to power off (PoE), I was actually on site, a contact centre going quiet is just as eerie as a server room going quiet. Of course i lost the webex when the switch went to. Luckily everything came up ok, still had the original issue despite new code. I sent some really snotty emails to cisco that day!! For some reason i didn't get into trouble for that one!
The issue was too many operands used across the various ACL's causing the cpu to have to process instead of the ASIC's.
few have seen their boss nonchalantly strolling through the office, critical piece of infrastructure under their blissfully ignorant arm.
Not quite that, but had a boss who occasionally had a fit of OCD. On one such occasion, he decided to clear out "unused" ethernet and power cables in the racks as they were untidy... He managed to disconnect or unplug two DCs, an ESXi Host with about 20 VMs on it, and the PRTG monitoring server before we could stop him.
On the mainframe, we had a new, keen data manager join my team. He had just been promoted from assistant, to full data manager. After a week of looking at the environment, and listing what was used etc. he got to work, and deleted lots of files which had not been used in the last 5 years. He managed to recover about half of the disk space so was very pleased with his work - till Monday. On Monday lots of people complained because about 1000 CICS regions did not start, because lots of data sets had been deleted. The existence of the file was checked at startup - they were only used when there was a problem.
Fortunately these were test systems, but it took several days to recreate the files.
Two things he learned from this was to check before making changes - and to set up rules to categorise data sets into "backup" or not-backup.
Ah. The Boss.
One place I joined, the DPM would take each new starter (including myself) and give them The Talk. One part of The Talk was to log into the '38, show the PWRDWNSYS command, point out the the default action is *CNTRLD (IBM-spik for "dies on its arse in a hideously and agonisingly drawn-out manner" as it disables each terminal, subsystem, controller, etc as soon as they are unused), show that we always change this to *IMMED and close with; "But we really don't want to do that now" and CMD-3 to quit.
Years later, I was out on site setting up a '36 and ready to test linking it back home. I called a colleague:
"Ok, all ready here. Can we test it now?".
"Why the fuck not? I want to finish today and get away."
"The comms subsystem is down."
"Well bring it back up!"
"Somebody's down a PWRDWNSYS *CNTRLD."
"WHAT!!??!! What sort of monumental fucking idiot would do that??"
"New starter and the DPM hit the wrong key..."
Wasn't there an option to specify a restart as well (as opposed to just power off). ISTR the AS/400 had that (non-default) option. Omitting it was a pucker moment when the entire machine went silent. The fix is simply toggling up on the power switch, which was pretty stubborn on the old iron.
Neither happened to me, but I've got a couple of stories..
1st one happened to a friend where I work now. When he was a young technician, he went into work one weekend to tidy up some of the comms rooms. This was a part of a larger project to ensure that all the comms rooms were tidy, and had standardised colour coding of cables. He came in first thing Saturday, pulled all the red cables (red cables were already standardised as they were installed by our switchboard provider). The red cables were used to patch the phones to the switchboard. He came in, enthusiastically pulled out all the cables, then realised he didn't have a clue which cable had been patched to which socket (this was important, as the switchboard assigned the extension number to the socket on the patch panel that the extension was patched to). He spent the next three days ringing every extension in the building from his mobile, then finding which extension had actually rung, then patching that connection to the right socket. As part of that same project, I spent the hottest part of that summer stuck in a patch room that was barely large enough for me and the patch panel, had no ventilation, and for some reason, I wasn't allowed to wedge the door open. I lost a lot of weight through sweat.
Thankfully, the current switchboard uses IP phones and assigns extension numbers to those IPs, so as long as the phone is patched to the right subnet, it works.
The 2nd one happened to a then online friend. He went into work to move a server. Powered down and packed it up properly. Then, when the complaints came in (and it hit this very publication), he realised what he'd done. He'd powered down and packed up one of the servers hosting NTL.com. Not sure why load balancing wasn't working, but his action took out the website.
The immediate response of any PFY/Boss/DevOps is usually "who me? no, mate"
Let me guess the bank. In that time the only 2 banks who used 'football pitch' to boast of their dealing room sizes were Deutsche Bank and NatWest, both then located at Broadgate.
Had a old DHCP machine removed once; so old it was not even rack mounted but sat happily in middle of a rack on top of another server. To make it clear it was important it was caged in with cut down wire panels from a old royal mail cart zip-tied around it to the rack frame and signs "do not touch ☺".
Yes it got removed one morning by someone removing/replacing the servers under it and never put it back. In the debrief when asked why it got touched the reply was "I saw the smiley face and thought it was a joke"
We needed to have a test server restarted in a data room in Cambridge while we were in London, but some genius had planted our server near a herd of others also on dev projects. Naturally, centralised procurement meant they were all 100% identical so getting someone to reset a machine was pretty much guaranteed to reboot the wrong machine and upset others - Murphy's law never sleeps.
We did have command line control, so we ran a quick line in bash to open and close the server's CD tray and left that running while directing a colleague to the right location. I believe machines now have various means to ID them (I recall HP having a blue LED that you could toggle), but this was quite a while back, illustrated by the fact that it was even a CD tray, not a DVD one.
Anyway, sometimes you gotta do what you gotta do. In those days I once took a keyboard apart so I could have the PCB + PS/2 connector hanging off a Compaq box that we used as a Linux box. Compaq ran this, well, scam that you couldn't start a desktop without a keyboard attached so you had to buy their more expensive machines if you wanted to run something as a server. Butchering a cheap keyboard was cheaper (couldn't leave a full keyboard connected as it was a bit of a skunkworks project to start with which we stuck in one of the comms racks, hence also the need to keep it cheap).
Not that we needed a reboot once it was stable - we deliberately left it running at the end of the project, just to clock a full year uptime :).
> Not that we needed a reboot once it was stable - we deliberately left it running at the end of the project, just to clock a full year uptime :).
I had a linux box on a customer site (mail gateway/leasedline router before the days of DSL ) that clocked just over 3 years of uptime.....
A cleaner unplugged it in order to use her vacuum cleaner. Despite the DO NOT SWITCH OFF label on the socket and DO NOT DISCONNECT on the plug
Despite the DO NOT SWITCH OFF label on the socket and DO NOT DISCONNECT on the plug
That's just about the surest method I know to achieve the opposite. This could be a worthy attempt at an Ig Noble prize: checking what would happen if you used a label "KEEP THIS OFF, absolutely DO NOT LEAVE CONNECTED". I suspect it would peel off from old age before anyone touched that.
Upvote for the keyboard hack! I had to do similar years ago for setting up some display monitors and didn't want people to mess about with them. Had to bend one of the blanking plates so I could put the PCB inside the case and thread out the PS2 cable to the socket. I was only allowed to use old Compaq PCs (well it was for a webpage that updated once every 5 mins, so fine). Also there was a setting in the BIOS so that they will switch on automatically after a power cut. Set up autologin, link to webpage in Startup folder - didn't have to touch them for weeks! Then get asked to set up another, and couldn't remember how I had done it!
We had something like this happen at an organisation (UK gov) where I worked on their service desk. Saturday morning on call and started getting calls that no one could get onto the network via vpn. After a few enquiries it was found that a project team were moving the 3G concentrator from a location near Heathrow to HQ in Central London. 2 identical boxes! Yep they had disconnected the VPN concentrator! It was rushed back to Heathrow in a hurry before they closed up access to the location and I believe some labelling was applied
Yup. A "pitch", in American English, is when a baseball pitcher throws a ball. The American version of the phrase "football pitch" is "soccer field". Also, not all football pitches are the same size; Wikipedia says 45-90m by 90-120m, so potentially a 2.5x difference in surface area.
Biting the hand that feeds IT © 1998–2020