Reply to post: An interesting workplace, Dr. Falken ...

Excuse me, what just happened? Resilience is tough when your failure is due to a 'sequence of events that was almost impossible to foresee'

dmesg
Mushroom

An interesting workplace, Dr. Falken ...

"If management have any sense, they will be persuadable that an approved outage during a predictable time window with the technical team standing by and watching like hawks is far better than an unexpected but entirely foreseeable outage when something breaks for real and the resilience turns out not to work."

Nope. Gotta have all systems up 24/7/365, you know. Can't look like laggards with scheduled downtime, now, can we?

Forget about routine downtime. We had to beg and plead for ad hoc scheduled maintenance windows. We tended to get them after a failure brought down the campus (and of course, we made good use of the failure downtime as well). But upper Administrators' memories were even shorter than our budget, and it would happen again a few months later.

Thank $DEITY for the NTP team knowing what they were doing. It was easy to bring independent local NTP servers on line ("Is it really this easy?? We must be doing something wrong"). We put in three or four, each synced independently to four or five NTP pool servers, but capable of keeping good time for a several days if the internet crapped out. The sane NTP setup resulted in a noticeable drop in gremlins across our servers, particularly the LDAP "cluster".

That LDAP setup was a treat: three machines configured for failover. Supposedly. One had never been configured properly and was an OS and LDAP version behind the others, but the other two wouldn't work unless the first was up. Failover didn't work. It was a cluster in multiple senses of the word, and everyone who set it up had departed for greener pastures. We didn't dare try to fix it; it was safer to not touch it and just reboot it when it failed. Actually, we wanted to fix it but it who has time for learning, planning, and executing a change amidst all the fire fighting?

<digressive rant>

Besides, fixing it wasn't really necessary, since the higher ups decided we were going to have a nice new nifty Active Directory service to replace it. Problem is, AD has a baked-in domain-naming convention ... and the name it wanted was already in use ... by the LDAP servers. We had to bring in a consulting service to design the changeover and help implement it. No problem, eh? Well, they were actually extremely competent and efficient but the mess that the previous IT staff had left was so snarled that the project was only three-quarters implemented when I left a year later. At least it had cloud-based redundancy, and failover seemed to work.

The reason for switching to AD? Officially, compatibility with authentication interfaces for external services (which, it turns out, could usually do straight LDAP too). Reading between the lines: it finally dawned on the previous team what a mess they'd made with LDAP and rather than redo it right they went after a new shiny. When they left there was an opportunity to kill the AD project, but a new reason arose, just before I came on board: the college president liked Outlook and the higher-ups decided that meant we had to use M$ back-end software.

</rant>

We also had dual independent AC units for the server room. Mis-specced. When one was operational it wasn't quite enough to cool the room. When the second kicked in it overcooled the room. If both ran too long it overcooled the AC equipment room as well, and both AC units iced up. Why would it cool the AC room? Why indeed. The machine room was in a sub-basement with no venting to the outside. The machine room vented into the AC equipment room, and that vented into the sub-basement hallway.

When the AC units froze up, que a call to Maintenance to find out where they'd taken our mobile AC unit this time. Then the fun of wheeling it across campus, down a freight elevator that occasionally blew its fuse between floors, into the machine room, then attaching the jury-rigged ducting. It could have been worse. We had our main backup server and tape drive in a telecomms room in another location, and that place didn't have redundant AC. It regularly failed, and for security's sake whoever was on-call got to spend the night by the open door with a couple big fans growling at the stars.

It was a matter of luck that one of our team had been an HVAC tech in a previous life and he was able to at least minimize the problems, and tell the Facilities staff what we really needed when the building was renovated.

Oh, do you want to hear about that whole-building floor-to-ceiling renovation? About the time a contractor used a disk grinder to cut through a pipe, including its asbestos cladding, shutting the whole building down for a month while it was cleaned up? With no (legal) access to the machine or AC room for much of that month? Another time, grasshopper.

<rant redux>

The college president commissioned an external review of the IT department to find out why we had so many outages, in preparation for firing the head of IT. The report came back dropping a 16-ton weight right on her for mismanagement. Politely worded but unmistakable. She tried to quash it but everyone knew it was on her desk. She then tried to get the most damning parts rewritten but the author wouldn't budge and eventually it all came out. Shortly afterward an all-IT-hands meeting was held where the President appeared (I was told she almost had to be dragged) and stated that we'd begin addressing the problems with band-aids, then move on to rubber bands. Band-aids. That was the exact word she used. I lasted another half year or so, but that was the clear beginning of the end.

The college is also my alma mater, and I have many fond memories of student days. But I don't respond to their donation pleas any more.

</rant>

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon