Experience is the best teacher
We all know that proper Change Control is Super-Duper-Important, and yet I'd wager that nearly every reader here has felt that same "bowel-loosening realisation" at least once in their career...
Today's edition of Who, Me? is a sorry tale of credit where it most certainly is not due. Our story takes us back several years to when our hero, helpfully Regomised as "Henry", had taken a job working at a large, multinational bank. Henry had been employed on the networking side of the institution and, keen to impress his …
Yes, but it's probably the other way round, no?
"I'd wager that nearly every reader here has felt that same "bowel-loosening realisation" at least once in their career..."
which in turn lead to:
"We all know that proper Change Control is Super-Duper-Important"
Intelligence = learning from your own mistakes
Wisdom = learning from others' mitakes
Let's just say that I likely am intelligent. Not wise.
"Intelligence is knowing a tomato is a fruit. Wisdom is knowing not to put one in a fruit salad."
In computer terms it might be expressed as "Intelligence is knowing where to whack a case to make it work again, Wisdom is knowing why to whack it there."
*Hands you a pint*
Drink up, it's cheaper than going to the psychologist. =-D
""Intelligence is knowing a tomato is a fruit. Wisdom is knowing not to put one in a fruit salad.""
Intelligence is knowing that a Watermelon is a fruit. Wisdom is knowing it is actually a cucumber, and works well when sauted with chicken (or pork chops) and served in a chipolte tomato sauce.
Hint: Get a melon as red and sweet as you can find. Use a melon baller for the watermelon. Don't warn your diners. They will bite into one of the balls, expecting a cherry tomato. The look on their face is priceless ... as is the empty plate soon after :-)
--safe-updateswhich stops you doing an update/delete statement without a where clause; this got added to all the client my.cnf on our servers after someone messed up a production DB. Amusingly, in the pre-Sun days, it used to be called
ok, I'll quickly delete the access on the traininig environment for all the non-admin profiles....
DELETE FROM USERS WHERE USER.WORKGROUP <> "ADMIN"
6424 record deleted
oh sh!t, oh sh!t - I'm in Production.....
ring, ring - hello Ops? Load the back-up tape from last night please please please please.....
Lesson learned. Always run a select before a delete to see what data you're about to mess with!
And for any operation that will change data, you should have started that with a BEGIN TRANSACTION.
That way, if you screw it up, you only have to deal with the database locking it will have created from moving all the records in the table into the log and back again when you do a ROLLBACK.
Of course, if you want to empty a massive table, you might want to forego the transactions and use a TRUNCATE to avoid the log file bloat. You'd better make sure you have your steel-reinforced underpants on first though before going anywhere near a TRUNCATE command (on a production server or otherwise, because of that situation where you thought it was the test environment but it turned out not to be).
Trouble is that "proper change control" invariably ends up implemented as "everything has to be signed off on by this bunch of rabid Nazis who find their phone a technical challenge".
Everything takes longer, deadlines get missed because the wrong font was used in the change submission, critical fixes cannot be deployed because CC won't provide an exemption for something they don't understand and never will.
So many times this.
Change control boards structured to include the entire management tree of the networks division, plus the one poor operator volunteered to be at the meeting this week - none of whom have the slightest clue whether the proposed change(s) to software will actually do what it proposes to do, and none of whom will actually be involved in implementing said changes. "Should we have someone from the software development side involved in these meetings?" "No chance! They'll just sit there complaining the whole time and make the meeting take longer!"
I'm a PM who implements complex IT systems, often with technology which is new to the organisation. I have lost count of the number of times I've been warned that the change board / infosec team / Service Desk / networking manager were obstructive and would delay my changes.
A quick meeting to introduce myself and let them know what experience I have, and that I come from a technical / ops background background, would usually help in the first place, but then submitting complete change requests with accurate impact assessment on failure and well thought out back out / recovery plan with confirmation that the appropriate technical resources / supplier support was available for both the change and the backout if required gets me through the change board in one visit. This include regulated industries, local and central government and the NHS.
If you are taking the time to plan a change well enough for it to work and not risk unplanned outages then you've already done all the work and pasting it into a change template takes 10 minutes. If you can't be arsed to complete the change documentation, don't show some professional respect for your change management / infosec colleagues then don't be surprised if they make you submit form after form until you get it right. To be fair dolts who don't realize this make my life so much easier as my changes get approved on a first visit while you get to play another round.
I agree with the above AC.
I've always tried to make my changes as complete as possible, and include the possible impact and the backout plan, and in general I've rarely had a problem with getting my changes through the change board.
Most of the time, the purpose of the change board is to make sure that adequate thought and planning has gone into the change, and if you achieve a good track record for accurately describing and implementing the change, and being able to cope when problems occur within the description of the backout process, then you generally have a relatively easy ride through the board. If problems happen that weren't anticipated, then there will need to be a wash-up afterwards, because it probably means that something wasn't considered.
If a change needs further scrutiny during the board, it probably means that you've either overlooked something, or that it is affecting something of critical importance, and people want to be absolutely sure that everything has been covered. These are valid outcomes of the board, although you may see it as obstructive if it prevents you doing the change. In that case, work harder to make sure it doesn't happen again.
This can come down to be a matter of trust, and you have to earn that trust,
"Most of the time, the purpose of the change board is to make sure that adequate thought and planning has gone into the change"
Just like HOAs, a change board frequently gets taken over by someone who likes "power" and becomes difficult for the simple sake of being difficult
OTOH - trying to get through to management that the system was designed to set up new products through a proper user interface that stitched everything together properly and not via a list of instructions for the DBAs to execute individual SQL statements that required CC clearance.
On the other hand, just this morning, we had a change that someone who wasn't especially familiar with it questioned a couple of bits. And they were right - a couple of data fields were, in fact, going to be set wrong. While CC takes longer, it does catch errors.
Yep. Due to some things flying under the radar and never being given the time of day, some vacant fields get appropriated by others and put to good use, only to be trashed by the department that didn't have the time spare to allocate the fields in the first place, leaving it to the users to accomplish for themselves.
Hence it is always useful to include the plebs in some discussions, so you at least understand how the real world is managing to work around your idealistic conception of what the real world is.
>>Trouble is that "proper change control" invariably ends up implemented as "everything has to be signed off on by this bunch of rabid Nazis who find their phone a technical challenge".
Many years ago a previously excellent network operator we purchased multiple circuits from suddenly started having a lot of large-scale outages. I asked the guys there what the heck was going on, they told me that the management implemented impassable change controls but changes could be still done without authorization when fixing network issues and their work wasn't going to do itself.
I was once part of a rapid development and deployment team in a company desperately needing to recover at least 20 years of progress. We were fantastic, we knew. Wow, we just got in there and got things done. We were transforming the company's IT estate. We each had our specialisms and trusted each other implicitly. It was a great time to be part of a team. Except...
We had one team member who just got in the way. He'd want to stop to check we'd thought something through, or wondered whether our changes would affect XYZ system in ABC department. In short, he was infuriating, and got up all our noses.
Then, as part of the company's attempts to meet the 21st century, we all had to go on those dreaded team building days. Almost to a team-member, we rebelled at this. What a waste of our time when we could be going in there, doing the good we were paid to do. We knew each other's strengths after all.
And at that team building day, to our surprise, we learnt just how often our apparently stuck-in-the-mud analyst team member was actually saving our bacon. After that, we listened a lot harder when he contributed his part, no longer hearing the "over my dead body" we'd been hearing, but rather thanking him when he stopped us doing the many wrong things we really were doing.
I've never got over my aversion to team building days etc, but I have to say, this one helped no end. I hope I've been more appreciative of the "thou shalt not" brigade since. They're there for a reason, and the reason can be career-saving. Thanks, Chris, wherever you are now.
At one place we had printers with Windows print queues, the users had a production system for tracking & creating the paperwork for the shipping of meat to the neighbours to the south.
The paperwork did not get pushed out through the Windows print queues, but to the same printers tied to their MAC address.
However, this one location wouldn't work like that (Legacy etc) & was tied to the IP address instead, leading to frequent calls to me that someone on "Nights" in Delhi (Icon - Coz only Lager can kill a curry.) had corrected the "mistake" again & the meat couldn't be shipped due to lack of paperwork coming out of the printers.
Back in the early days, it was fairly common for me to arrive on site to troubleshoot, pull up the logs, and be asked with shock (and sometimes more than a little awe) "You can DO that? WOW!" ... The same type of people were even more flabbergasted when we could do it remotely. It was all magic as far as they were concerned.
A lot of the banking codes that a bank has to abide by mandate a level of auditing that should have picked up the unauthorized change to the compression setting (or as an absolute minimum, who had logged on with time and date info and who had privilege to make such a change). Banks have to have full audit tracking of changes to the infrastructure, and in fact, the one that I worked for kept full records of all of the activity of all of the administrators.
If this one did not, then there is a definite deficiency in their processes.
I suppose that a lot depends on how old the story is. If it happened this century, I think that 'Henry' was exceptionally lucky not to have been dismissed once the logs had been reviewed.
Given the language used to describe the technology used it sounds late 20th century.
Compression on comms links has been around for a while and generally derided by all. Source coding compression is far better and has used since before the days of voyager.
I think riverbed where well known for their wan compression systems and Cisco also had a go in the early 21st century and both where widely derided.
Dial ip modems used to have compression features available to which failed on noisy lines and obviously required the capabilities on both ends. You can only compress data so far, there is little to no gain compressing already compressed data like encoded audio and video.
We’ve got tonnes of documentation, finding it is difficult and knowing it applies to what your looking at is harder.
I often come across undocumented parts of the network, traffic traversing an undocumented subnet with no detail of what l3 switch the unrouted vlan is on. Really unhelpful when you need to amend the access list you don’t know about on that vlan you don’t know is there.
When one’s section head of technology (they could not call him Section Head of IT for obvious reasons) has decided off his own bat to conduct a test of our new Cat 6 cabling by telling staff to pull each cable in the hub to check that it is firmly locked into position.
2 hours later, as his boss, I get to take all the credit for fixing the various outages by creating a very long memo to the effect that we followed up reports of ‘stretching’ in IT network cabling by testing each cable, and I even got someone working at the IT desk in a non existent department outside Cheltenham to confirm this.
When I was young, I always thought the first time was an accident, or someone else's fault, or something. In those days only after the second "event" would I figure out, it was all me. So I wouldn't do it a third time. As an adult, I figured out the first time, yeah that one was mine. So I wouldn't do it a second time. And I concluded, the value of experience is missing out on the second time.
I had to go on site to help resolve a customer problem. I sat along side the customer and helped him configure the product and it was going well. Then I noticed we had a third person in the booth, someone had quietly snuck in while we were busy. It was Jo, the duty manager. We had a nice little chat, and I explained why I was there, and what we were trying to do. All very amicable, her parting shot was we had popped on the radar because they had been notified that "unusual" commands were being executed on the test system. The commands were locked down in production - but not on test, they just produced an alert if they were used.
Without any fuss, this reminded the team that unusual activity was being monitored, and that it would be nice to tell the operations people if there were any unusual activities being planned.
I sometimes get engaged in trouble shooting or performance tuning for customers. Review their configuration, spot all the out of the box defaults that weren't changed after the initial install, make a change request and wait for however long for it to be actioned - hopefully less than 24hrs. One time I spotted something that had been changed from the default setting, but not to something that I thought normal. That's odd I mused, somewhat outside the remit of the task I had been asked to perform, but I changed it to what I would normally recommend - who wouldn't want that improvement, even if they hadn't asked? Of course that single modification caused havoc once the 24 hr change window had completed. Oh crap. Hastily come up with a "slight incompatibility issue" excuse, obviously tweak some inconsequential parameters and surreptitiously change back the broken one. Try to get emergency change control initiated.
At the other end of the scale I was at one customer and made config recommendations, with the caveat that they need to be proven in a test environment etc. before being implemented in production. I saw the DBA edit a config file and expected it to be the test system or a change to production that would be implemented after testing and change control had agreed. But being a "real" DBA, he had made the change to the production system and then restarted the database server, without any warning, in the middle of the day. There were about 500 sessions connected at the time. Phone starts ringing. He ignores it - "no problem, they'll just get a coffee and reconnect".
> Phone starts ringing. He ignores it - "no problem, they'll just get a coffee and reconnect".
If the client software can't handle a disconnection, then it's not enterprise grade anyway
(My reasoning on this: once you start using resiliant/distributed databases, server switches will cause client drops anyway. They software has to be able to cope with it happening)
There is a shitload of "Enterprise grade" software out there which is fragile as eggshells and shouldn't be there
There are also a nyumber of "Enterprise" of vendors who will respond to griping about this behaviour (or detailing the issues on review sites) by raising complaints with your employer instead of actually FIXING the brokenness
It's trivial to handle short disconnects by a retry, but how long should the software wait before giving up, and how should that be handled? It's hard to be graceful when your database suddenly isn't there, especially if the software is in the middle of some complicated dance between several different things that need to be coordinated. The sort of things where you're trying to commit the result of an operation to several databases to update them to say you're finished, but the last one goes down as you're committing and doesn't respond with whether the transaction was completed. Do you reverse everything else you just committed on the other data sources and risk the DB coming back up with a completed status when it has actually been rolled back? Or do the opposite and risk an operation being duplicated?
Sometimes failing gracefully isn't graceful...
"If the client software can't handle a disconnection, then it's not enterprise grade anyway (My reasoning on this: once you start using resiliant/distributed databases, server switches will cause client drops anyway."
I think this reasoning is flawed. If your database gets transferred to a different node in a distributed database, you'll lose connection very briefly because your new node is already available even if the previous one isn't. A simple DB down, retry, connect, and the system recovers. At worst, the client has to store a log of the stuff the previous node might or might not have done to check, not as bad if the operations are idempotent.
When the database isn't distributed, that doesn't happen. The DB down happens, but the retry doesn't immediately turn up an alternative. What should the client do? This could be the database went down. Or the network went down. Or the network cable came loose from the computer and the connection isn't going to get fixed until the user puts it back. The client can easily cache the operation the user wanted to perform and have it ready to resume when the database comes back, but depending on what happened, that might not be a good option. If it was a problem at the user's end and nothing happened for an hour while it got fixed, repeating an hour-old operation might be a problem if other data has changed in the meantime. The client has to do something, and it's unlikely it can guess at the users' intent all of the time. Best in those cases to just tell them the database can't be reached, invite to troubleshoot, and not to start doing things automatically when it comes back. As long as it doesn't crash, I don't think it's fragile.
I remember installing DecNet S/W on an HP-UX box to communicate with a VAX. DecNet packet addressing was based on the assumption that all the NICs were Dec and so had the same MSBs. in the MAC. In order to work the S/W changed the HP MAC to look like a Dec with no warning. This rendered all the users' caches invalid and broke their connections. Fortunately this was in the days of character-based applications with client and RDBMS running on the same box so there were no database connections broken and the user PCs caught up with the change of address pretty quickly.
Some years ago I was looking at a minor glitch on a customer firewall, located 300 miles away in a secure datacentre. It was a Cisco ASA of some variety and dozens of small branch sites connected back to the main corporate network through it. I can only assume I was overworked or hadn't been getting enough sleep because I enabled verbose debugging to my console, but for all protocols; the thing almost immediately locked up under the strain and flashing red signs started to appear all over the big NASA-style status monitor on the office wall. Cue the afore-mentioned moment of horror. The boss ran in to the office in full panic mode as I was thinking about clearing my desk and updating the old CV, but then I realised - there was no external logging for the firewall so the only trace of what I'd done was in the running config. "I know, I've just seen the outage! Calling the datacentre now!" I rang the site and had them power-cycle the device, 5 minutes later everything was happily reconnecting and I was taking the credit for my calm and prompt response. I never owned up to it.
Was once given a 4 month gig at a multinational Insurance company. Back then I was the LAN/WAN Man. After careful network diagnostics (and a lot of bemusement) I came to the bizarre conclusion that having 700+ NT4 workstations on the same flat network as the servers was a mildly bad idea.
Having shared my earthshattering revelation I was told "We know, we're waiting for Change Control". An interesting 4 months trying to look busy and spend as much time in the pub as possible.
I know what you mean. I once got a gig in a very large administration, only to be told a month later that all changes were frozen.
I spent 18 months trying to find stuff to say at a weekly department meeting when they knew perfectly well I was sitting there doing nothing !
What I want love to hear about is people that caused outages that actually brought down the company, never to return. Quite often you hear of outages that cost a company enormous amounts of money and productivity. However they never seem to wind up wound up and they survive. Probably because they make up what was lost and people do double time to make up the lost productivity. So these things are quite often portrayed as terrible events, ultimately they never actually amount to much more than a blip. Is there a 'brought down the network brought down the company' kinda story out there?
I deployed a networked software solution that chopped several hours off a 27 hour day, and thus this became a popular product. My deputy at the time ran an embryonic intranet server, and I used this to document everything including the fact that the compression button broke everything and should never be used. In this documentation I built in step-by-step playbooks for all the problems I had encountered including the "transfers hang = compression turned on" one.
My deputy claimed I never showed him anything, played his face to his manager (we had different ones due to some truly imaginative org trees) who had me fired off the project on a trumped up "annoys the users" charge. To stop me from walking they gave me Mr Backstab's web server to manage along with all the other stuff I did anyway.
Two weeks in I walk in and the network techs beg me to take a look. Mr BS was out somewhere and nothing was working across a networked enterprise the size of England. I said I wasn't allowed to touch it, but the big boss came in and said "just do it" so I did. I walked into the network room and under the eyes of the network techs I unchecked the "compression" box, and everything woke up.
Then I went into the Big Boss's office and made him fire up the intranet site for the product.
"Go to the problem page. It's the fourth hyperlink in the four member list"
"Now find the symptom we were seeing and click on that"
"Now read off the fix"
"Uncheck the compression option"
"Now tell me I never showed Mr BS how to do anything."
This hero presided over several more uckfups with that product which Those In Charge asked me to fix, including deploying a new version of Unix untested (and so not knowing about the added step of authorizing the printer so it could be reached from a PC, locking the entire thing solid in front of some VIPs) and bricking half of the training room's PCs two weeks after loudly acquiring a Microsoft Certification.
He can't look me in the face to this day. Totally worth it.
Biting the hand that feeds IT © 1998–2021