So the only administrator for a business critical system just took the day off during a major infrastructure change?
That sounds like terrible planning from the company to me.
Welcome back to On Call, where you get to take a breather and enjoy tales of tech support adversity from your peers. And taking a breather is exactly what this week's Reg reader, "Adrian", needed to do at the end of this particular incident. About two decades ago, Adrian was the network administrator for a medium-sized …
That's quite normal, the PHB probably thought that everything would go fine and the staff would fix any issues - no need to miss a nice day for golf. Turned out that way too - the PHB probably got a nice bonus at the end of the year for the successful migration too. Welcome to modern business management.
unless something goes wrong that involves upper manglement attention to fix, they never appreciate the complexity and difficulty in achieving things.
I've done a several multisite (10+) WAN migrations in different jobs, i remember one was global and had the full attention of the business leaders as it forced the multi billion $ business to go manual on its 24/7 business operations whilst we cut over, only a few hours of disruption of which downtime was less than 5 minutes but we had the boards full attention, blessing and thanks for completing such a complex task on time with minimal downtime and disruption.
Another one was a bit smaller (still multi billion £ turnover 4k+ uk staff), no one in the business cared about the potential for disaster and after a year of planning to have the whole weekend from close of business ~ 7pm Friday to 6am Monday to do the work the organisation decided 4 days before the work started that they needed the network back on the Saturday afternoon. Started ~ 8pm Friday finished ~ midnight and caused no issues but found tons of odd stuff to work on later (stupid async routing across stupidly implemented dual links across stupid other stuff) but we had next to none business testing only what i could test and what those that needed the live service on the Saturday worked on. No one reported any issues though as no one noticed any issues, barely got a thanks that time.
You never get thanks for things people don't appreciate.
If it works nobody cares and nobody pats you on the back.
If it doesn't work everyone cares and they are all ready to kick your ****
This is why I left IT Support. Doesn't matter why things don't work, if you are the nearest IT professional you get it in the neck regardless. I can't think of any other department where that happens.
"Although everyone remembers Y2K because nothing happened."
Several things happened in the leadup to it.
Most of the world's alarm systems were discovered to be set to malfunction on 9/9/99 (reported here at the time) as this was used as a "test date"
A lot of NTP-using switches and routers around the world went titsup earlier in February 1999 when the number of seconds since Jan 1 1970 overflowed a signed long integer - this knocked out virtually the entire Chinese Internet until the cause was realised (the routers were rebooting, and would then reboot again when they tried to NTP sync. The fix was to disable NTP)
1/4 of New Zealand's phone system went completely titsup for 12 hours due to memory corruption whilst loading in Y2k fixes and corrupted backups meant that it took the best part of 3 months to replay all the database changes from a known-good backup once dialtone was restored.
Various other things happened, but these were all stuff I experienced personally and they were all reported on this website at the time.
My Y2K nightmare happened on 1/1/2001. It seems the billing engine on the so called "intelligent network" device decided that 1/1/1 was a test day and started tossing billing records in the bin. Somewhere, some people has a great January as we had no data to bill them against.
"So everyone not in IT thinks it was all a ruse by IT techs to get paids loads".
I didn't get paid an extra bean. They even chained up the door to the computer centre just in case the heating control system set fire to everything, that's how confident they were. (I did point out that this was the same type of heating control system that covered student halls, which were still occupied by a few overseas students over Christmas)
Just after midnight, I quickly checked the most critical systems, said "Meh, my stuff's fine, everything else can go whistle" and went back to partying like it wasn't 1999 any more.
Some sites turned EVERYTHING off, just in case. Some of them then discovered that leaving stuff powered down but still plugged in won't necessarily save it from a nearby lightning strike. Just when you think you've covered all possibilities in your disaster recovery plan, Mother Nature says "Hold my beer".
That'll be cyear from the ctime library with a hard-coded century tacked on. Somewhere there'll be a mouldering set of code comments to the effect that this needs reviewing before 2000.
I remember that, come Y2K, the fact that cyear is the number of years since 1900 turned out to be one of those things that a lot of people didn't know and just assumed it was the two-digit year...oops.
Most high-profile one I found on the glorious day was the US Naval Observatory's clock on the web merrily showing the year as 100...
As I tell prospective new clients "It's my job to ensure we see as little of each other as possible".
Related: I implemented a four hour minimum for on-site visits in (roughly) 1990, a couple years after I went solo. Double on weekends/holidays. A few clients balked at the new rate ... I simply told 'em "Don't call me unless you actually need me".
A new issue soon arose: Convincing 'em to pay 4 hours for a one minute visit. The old TV repairman's maxim applied, "I'm not charging you for thumping your TV with a screwdriver. I'm charging you for knowing where and how hard to thump your TV, and for showing up to do it". The explanation seems to have worked ... although about nine months ago, a child CEO wondered why I'd need to thump a TV with a screwdriver.
IT is many things, but it's rarely boring. "May you live in interesting times" may not be an actual old Chinese curse, but it's applicable anyway.
"IT is many things, but it's rarely boring."
Since you've been around the block a bit I'm finding this statement to be a bit difficult to swallow, mainly because the number of fuckwits *in charge* of IT teams are boringly predictable in their never ending quest to do the wrong thing and ignore all advice to the contrary because they apparently can't read an email explanation that contains more than one concept.
Bah, humbug!
"Another one was a bit smaller (still multi billion £ turnover 4k+ uk staff), no one in the business cared about the potential for disaster and after a year of planning to have the whole weekend from close of business ~ 7pm Friday to 6am Monday to do the work the organisation decided 4 days before the work started that they needed the network back on the Saturday afternoon."
I would have cancelled the whole migration about five minutes after being informed about that decision and rescheduled for a couple of months later, again a complete weekend.
Sorry, no. It's not normal at all. As the OP suggests it is actually incredibly bad planning. If either a PM or TM had been that lax on my watch, he would be out of a job for allowing such a basic oversight.
On Operational day #1 after a major migration or implementation you should be in a Hypercare phase or similar, whatever you want to call it - where ALL key support and admin staff for critical applications are available on a rapidized reponse basis to quickly investigate all issues that arise that may prevent business operations continuing.
To allow the sole admin or support contact for (what sounds like) a mission critical business application to be OoO on the first day after a major migration or implementation is a schoolboy resource planning error that should have been caught by your pre go-live checklist or G/NG meeting, and then captured as an issue to be impact assessed and agreed with stakeholders at the final Y/N before the migration started.
It just sounds like shit planning, as the OP suggests.
Of course the real problem here is that there is only one person in the business able to support/administer the software.
We have a similar issue where I work, I am the only Dynamics Admin for our company and also the only Dynamics developer. So when I take time off there is no support for the product, I don't take work related calls on my days off so if something goes wrong it has to wait until after I get back to work.
It's nowhere near ideal but also not my fault. The board have, time and again, refused to hire a junior Dynamics person that I can train up.
Also with projects like this where the date get often gets shifted you can't expect people to constantly cancel and re-book things that they have planned and possibly already paid for just because the delivery date has slipped again.
This was common practice, usually the management team of a project would take a few weeks off around the migration date. They justified it because they booked early for after the implementation date and then couldn't change when that date slipped.
If the implementation went smoothly, they came back to take credit for their good planning, if it didn't they came back to rescue the company from others incompetence.
When I worked for a large qango, someone noticed a pattern that the Director of IT would be on leave when the results of the salary review came out, so an enterprising db admin wrote a dbtrigger that would notify him, and thus everybody else when the review was about to emerge , sorry director was taking leave (always after 5pm on a Friday), so instead of doing a POETS, we would all stay back
He asked the guy responsible for the application about the error and got told it was nothing to worry about. Unless you have a particular reason to not trust that person then I can't see that he did ANYTHING wrong, you got to remember that error / warning messages can quite often be rather unintelligible unless you know the software rather well. I do hope the application guy got a nice welcome when he got back to work though...
"Internet not working - didn't pay the bill for 6 months"
I had an ISP for my home internet who didn't get paid for a few months because my card had expired. I eventually got a final and only warning. Was it too difficult for them to have emailed a warning the first time payment failed? Sometimes these companies can't look after their own interests as well as they might.
At one time we used SunAccount. They put a warning on every screen at login about a month before the licence was due for renewal. It played havoc with our screen-scraping program we used to keep account details in sync with the ordering system.
Yes, my father drummed it into me when I was growing up.
When you go somewhere new, the most important people on site are:
1. Security / Receptionist
2. The secretaries / PAs to the top level management
3. Site services
and the everyone else
These days, I'd add BOFH high up on the list, but I'm biased, having graduated from the BOFH School back in the 80s.
Big D, I'd add the Admin staff to your list. PAs might be good for getting you an 'in' with their pet manager, but some of them have almost the same percentage of a clue as their manager!
Don't forget who orders the stationery, puts the holiday bookings on the system, can normally find an answer to ANY problem you have (even the ones that don't have anything to do with their job!), logs the faults when the printer/kettle/toilet stops working properly...
Anon because I've just done something that will cause my local admin team some grief and I want a chance to get the cakes in BEFORE they find out!
It's one of the first things you learn at uni when you go into post-grad education (and for the advanced player even undergrad education).
Forget professors and lecturers - if you get on the good side of secretaries, security, storesmen and technicians (including syshacks) you can get absolutely anything done and in an amazingly short time.
The best lesson I ever learned in education, one that's served me very well in my real life pretending to be gainfully employed (well I pretend to work and they pretend to pay me, so it balances!).
Editted to add - I see I wasn't the only one to learn that lesson. Nice one big_D!
Forget professors and lecturers - if you get on the good side of secretaries, security, storesmen and technicians (including syshacks) you can get absolutely anything done and in an amazingly short time.
Being on the right side of a technician obtained us the superuser account for a Unix box to which (for no apparent reason) we only had access during limited hours. The box was standalone in a small lab and only we were using it so it was incredibly frustrating to be kicked out of our opsys project* work for a few hours in the middle of the day. From superuser a quick su would get us into our group accounts and we could carry on working.
*I wonder how many undergrads these days are expected to write a multi-tasking operating system?
"Forget professors and lecturers - if you get on the good side of secretaries, security, storesmen and technicians (including syshacks) you can get absolutely anything done and in an amazingly short time".
Oh hell yeah, and I bet this earned the Prof a "Laser Stare of Doom" for a long time afterwards !
(Icon, for a Prof reflecting on bad life choices)
Same here. Our department assistant had a secretary. While he was a nice guy and good with all admin stuff he was responsible for his knowledge paled in comparison to hers. She even called me when I was close to missing some exam deadlines. Knew every student I believe, and saved the bacon of quite a few. They don't make them like this any more :(
Amen Brother Joe! Our university department secretary was a gem at taking care of (often pissy) students. Many times I would have quit out of frustration if not for her; I owe her my PhD. .
I made up a walnut plaque inscribe "Surrogate Mother of the Year Award" that had scissors cutting red tape. All the other grad students jumped at the chance to chip in to buy a cake and throw a little party for her.
Here's to you Jean Sanford. May you be treated as well as you treated us students.
When I was at school (age 13) I was picked on by one nasty shit of a PE teacher. I remember telling my friend, who's dad (who happened to be the school janitor) overheard and said he would sort it out, which he did.
I never did learn what it was the janitor had seen the PE teacher doing that gave him the necessary leverage, but the PE teacher was practically human towards me from that day onwards.
I tend to disdain anyone who has a PA, but PA's themselves are very useful people to be friends with.
All to often it ends up being the PA that does the actual work that the manager gets paid for, which makes me wonder if the existence of the PA isn't just an example of the Peter Principle in action...
That's why the ladies at the front desk and the office manageress, the IT drones and their boss get a little something from me every Christmas.
Nothing significant, but *every* Christmas. I have never missed one. Because when "stuff" happens, these are the folks who are a big factor in getting whatever it is, "un-stuffed"
Also...Sysadmin Appreciation Day.
// Take care of the people who take care of you.
Many years ago I got called out to a site in, as I recall, Hammersmith in London.
Quite a long drag for me as a Midlands lad.
One of my colleagues had built the customer a shiny new Citrix farm. With an Evaluation copy of Windows NT4 TSE
Because he erroneously believed that you could simply add a license.
Which back then, you couldn't.
And it wasn't as graceful as the reboot every 24 hours. Oh no...this was a BSOD with a license violation error every 24 hours.
Cue a rebuild of the environment.
At least it was documented...lol of course it wasn't.
Unfortunately the guy who did the original build was a few days into a two week holiday abroad so he ducked that cluster-stuff nicely.
Because he erroneously believed that you could simply add a license.
Which back then, you couldn't.
As if you could today?
On the 90-day Enterprise Evaluation SKU, you could simply never add a license. The most you could do is extend the trial another 90 days for two times (total of 90 * 3 = 270 days). Pas mal, as every Windows installation *needs* a reinstall after that, with the performance degradation and all that.
But add a license to go to official Enterprise? No way. It's a different SKU.
ABSOLUTELY NO IDEA why is that.
The way I understand it, you are not supposed to be running a production environment on an evaluation license. However, real life frequently doesn't work like that. :)
SQL server licensing, however, lets one swap from express to standard to enterprise by re-running setup and performing an edition upgrade, which basically flips some bits around and kicks the database engine to activate the new features.
Some of the commentators on this thread have obviously little or no experience of large scale migration projects. When you are migrating several hundred application services there are always errors present in the log files. The apps are often written in technologies you know little or nothing about so you have to rely on the application admins to be honest and knowledgeable about whether reported errors are significant. I always used to ask if the errors had been present during operation a month ago. I soon got used to the answer 'we never check the log files unless there is an outage' These days admins are often responsible for many and with automated event handling many admins no longer know what 'normal' looks like. I this case I place no blame on the poor chump running the migration but lots on the PHB who let any application admins have leave during this period and on the app admin himself for not investigating it further.
Yes! It seems that a commonly missed factor in migration planning is "what does the system look like now" - right down to inspecting log files for what "normal" really looks like. The gold standard is that a good system "runs clean" - you can explain EVERY entry in the logs, and there's nothing there you can't explain. Got a tiny little assert logged? Run it down. If you can't do that, then you had better know well in advance that these things are "normal" and that you shouldn't freak out after a migration when you see them again. It just cuts down on stress for everyone.
Our security guys decided we need to close one hole in our cloud accounts, after some of them had been already compromised (after years of us lowly admins whinging about changing non-expiring passwords on generic accounts etc, and other such basic account hygiene).
Cut to me running a script on 6000+ cloud accounts on FRIDAY AFTERNOON - most of the office, including techs, has already disappeared on leave - to enable MFA. I'm amazed I didn't get any phone calls over the weekend (fingers still crossed).
Just finishing work for the week/year here and I have just been told of a major server upgrade that needs to be done* and if that was "something you can do during your Christmas Holidays"
* I can guarantee you that manglement have known about this for months but have only dropped it into my lap today.
That'd be on overtime rates, plus reimbursement of any costs incurred for cancellation of plans already made. Preferably with time off in lieu to cover the missed holiday, seeing as you're asking me to bail you out at short notice. Too expensive? Oh, then no can do, I'll be on holiday...
@short
I was laid off once for telling a manager that. I could prove that he held the project just to screw over a couple of us and our scheduled vacation but it didn't matter.
Ended up getting a better job, so no harm no foul. I wouldn't mind having a little anonymous wall to wall counseling with said manager though.
> if that was "something you can do during your Christmas Holidays"
Sure, but as I'm on holiday, it will have to be rebooked - at your expense - and because of the short notice I expect penalty rates.
"* I can guarantee you that manglement have known about this for months but have only dropped it into my lap today."
When I was tecknishunning it was common for customers to put up with a comms fault all week and then drop it in our lap as an urgent fault last thing on Friday - cue lots of overtime and having to drive to a site amongst weekend traffic.
I have taken to starting and finishing early on Friday's (without telling people I don't trust).
That way, when they drop something in my lap at 5pm that would take 2 hours to sort it doesn't 'officially' get seen until Monday morning.
I say 'officially' because I would actually know at 5.01pm and just spend the rest of the weekend basking in the glow of a bastardly deed well done :)
I've had holidays booked in months where nothing was planned to go live you can pretty much guarantee the dates will slip, slide and meander to the dates I've booked off. The only way to plan a holiday and guarantee not to be in a go live where I work is to book it for the original planned release date.
"The only way to plan a holiday and guarantee not to be in a go live where I work is to book it for the original planned release date."
The very first time you really try that, Murphy kicks in and it will be the first time in company history an original planned release date is made. Just to top that, there will be a minor but crucial error in the released version which only shows up on production.
Even better, I booked my xmas leave in June, got told I was being silly "we're not in that week, it all goes to the Oz desk". Still booked it, still got approved. Suggested my team do it too.
October rolls around, turns out Oz is shitty about taking our calls (no surprise, they have nfi about our clients) so manglement tried to ask us to stay, then tried to revoke our leave, then tried to enforce our contracts.
Turns out HR fucked up the contracts (I'd pointed this out to them several times) so when they tried to holds us to their terms, we went work to rule.
We were an afterhours team. 1800-0200 in the week, 0700 - 1900 weekends. Our contracts said we worked M-F between 0700 and 1800.
So after accepting they couldn't force us under contract, manglement announced that, bugger rules and promises, we would be covering xmas.
So half the team (including myself) quit.
Guess what happened. Yup, now there wasn't enough team to cover xmas, Oz would have to do it.
But they did lose their supervisor and two most skilled techs. So well played....
Incidentally, I did work there again, albeit at 3x the rate. New manager, who was quite happy to see me, since he knew bloody well why the previous manger had gotten the boot.
Time 1991, had just been laid off due to restructuring (ie Retail Company had to find budget money to bring EDS in)
8 of us let go out of 30 or so.
The remaining crew were had gotten out of being 'on call" because "we are female and it's a bad part of town" or "We are working on the new systems and we shoudln't have to work on the old programs" ..etc..etc.blah blah blah. Guess who remained on the on-call list- yup the 8 of us that got laid off.
Anyway, the layoffs occurred at 4PM on a Friday.
On following Tuesday morning at 2 AM got a call from operations "We got an error on xyz job, you need to come in and fix it"
Me: no can do, been laid off.
Them: Well I guess we can call Joe then.
ME: Nope, he got laid off too.
Them: What should we do?
Me: Call the bosses' boss and tell them no one is on call anymore ...
Looking back I should have done the standard SOP :
"HEY, restore all the backups from 5PM yesterday and I will be in to fix things", and then
rolled over and went back asleep and when they called again, apologize - " oh I must have dozed off' and kept the ruse up till the remaining crew
showed up in the morning - but I didn't.
Live and learn I guess ....
Reminds of the time I was working with a colleague to help out with a problem in a software package. Going through what he had done before step by step an error message reading "XYZ failed - click OK to recklessly proceed anyway" popped up. "Ah" said I, "I think we should look into that" as my colleague reached to click "OK". "Oh, don't worry, that's not the problem" said said colleauge with absolute assurance. What is gobsmacking about this is that he knew (that is the reason I was there) that I was the main author of the software!
Shortly after I started work for a medium-sized company as their "IT guy," our web site and email went down. If you guessed it was the company domain name had expired, you're right.
Although somewhat a lame excuse, I really hadn't been there long enough to start looking into all the infrastructure details[1], and my predecessor had helpfully set all the contact information on the domain to his own company email account, which he deleted before he left, so nobody got any warnings. At least it was an easy fix. I reset the contact info to go to both myself and my boss, the controller.
[1] It took me several days just to clean up my new work computer. It was good hardware, but my predecessor had a porn collection to rival the PFY's, categorized and all (including "BESTIALITY"). He also had a browser hijacker and other crap on there. I couldn't just wipe and re-install the computer because they didn't have the install CDs for all the software(!) It took me a couple of days just to get that computer sorted so I could start looking into the many other issues.
Never ever use personal email addresses for things like that. Use an administrative name that is really a distribution list and have some paperwork somewhere in big crayon that ensures several people should know about it all whom can be attached to the arse kicker should it fail.
Never ever use personal email addresses for things like that. Use an administrative name that is really a distribution list.
I have come across more and more vendors and services, such as SSL Certificates, that no longer allow that. You have to resort to using a email address that LOOKS like a personal address even though it is a distribution list.
With the coming of browser checks for https certificates, and warnings when they are not right / expired / chain broke worrying users who don't understand these things, why do people still insist on migrating their sites to https when they have no data traffic worth encrypting? Don't they realise they are creating themselves huge maintenance headaches? Er, that'll be a "no"
Re. being laid off and still being needed to Fix Stuff, the old joke is
"There was an engineer who had an exceptional gift for fixing all things. After serving his company loyally for over 30 years, he happily retired.
Some time later the company contacted him regarding a seemingly impossible problem they were having with one of their multi-million dollar machines. They had tried everything and everyone else to get the machine fixed, but to no avail. In desperation, they called on the retired engineer who had solved so many of their problems in the past. The engineer reluctantly took the challenge.
He spent a day studying the issue. At the end of the day, he marked a small "x" in chalk on a particular component of the machine and proudly stated, "That's where your problem is". The part was replaced and the machine worked perfectly again. The company received a bill for £50,000 from the engineer for his service. They thought this was steep and demanded an itemized accounting of his charges.
The engineer responded with the following account:
Chalk: £1
Knowing where to put it: £49,999
No end of times I get this.
User " An error popped up".
Me "Ok, What did it say"
User " I can't remember"
Me "What were you doing at the time"
User "I can't Remember"
Me "When did it last happen"
User "Months ago" (even though you are there on an unrelated matter.
or
User " I have a message here what do I do"
Me "Ok what is it saying"
User "Well its just asking me to click OK"
Me "Anything Else"
User "No, Just ok"
Me "Well you don't have much choice really than to just click OK."