Not waving, but drowning
in observability. The side cars are running full steam, the metrics are flowing. They all break at the same time in slightly different ways. Which is chicken and which is egg? SNOWbody knows...
In a multicloud world in which IT teams are asked to manage their own infrastructure and apps as well as help out when lines of business buy and then break their own tech, ServiceNow thinks you might wish to consider shrinking Mean Time To Innocence (MTTI) – the amount of time it takes to prove that a problem isn't your fault …
I like that term... I'm always recording my MTTI - well, in some cases my personal MTTI is in the months to years, but that's mostly because of several stubborn outlying cases.
Like spending 18 months proving that an outside MSP were a shower - literally defrauding us with impossible product claims, breaking things that we hadn't touched in years and then blaming us for it, interfering with deep level networking that they didn't understand, even actively kicking one of my guys off a remote desktop session on a server and then *restoring an ancient VM checkpoint* without permission in the few minutes it took us to work out why we couldn't get back into that server and why things were going off. The server jumped back months in time because they didn't know the difference between "Delete Checkpoint" and "Apply Checkpoint". Again, I had to prove every little bit of that from logs to prove our innocence.
Despite full documentation and contact methods, they stomped over the IP of a critical iSCSI storage device on a protected VLAN in the middle of the working day... to install a test VM to try to prove that a product they'd sold us would work on Linux VMs (and we hadn't bought that software yet, and they were spinning up a bare Ubuntu machine to test it, so hardly critical functionality!). Again, no notification or change management, just our storage collapsing and nobody admitting fault until I provided logs showing exactly what happened. Oh, and they never got that product working on Linux VMs (P.S. Avoid "Datto" backups like the damn plague... such a bodge-job of a backup platform!). They did exactly the same a few weeks later for another device and knocked out the boss's main printer as it fought to keep hold of its static IP that was being stomped over, and we initially got the blame for that too.
It wasn't even like we were at opposition - we were the long-established team, we were NOT going to be made redundant or anything, the MSP were brought in because we were short-staffed but couldn't ever provide any on-site personnel and the job was very hands-on. We had full documentation and change management, and some outsiders were given remote access AND insisted on having all our docs (and even then rewriting them into *their* standard format for consistency with the rest of their customers, which took weeks!) and then completely ignored them all. They sold us products that even the manufacturers had warned them would not work the way they wanted, multiple times, (I know, because I bypassed the MSP and spoke to the company directly and they basically said "Oh... you're the end-customer, yes we specifically warned your MSP about this and they ignored us, please don't blame us! I'll give you that in writing if you need it."). and one of the devices they sold us literally sat unconfigured for a year because they were supposed to get it working. It was installed in our rack. Not cabled. Never turned on. (And we know because WE controlled access to the racks). Because it simply wasn't compatible.
They removed a perfectly functional and secure IPSec VPN and replaced it with an almost identical "zero config" VPN device at great expense. Except that they couldn't then configure it correctly, so access control and CCTV at our secondary site literally never worked properly while it was in place - because it was reliant on routing rules that they were too dumb to be able to implement. Additionally, the irony was that the devices they put in still had to sit behind the very routers that were ALREADY providing the original IPSec VPN. So, in fact, all we did was turn off a working VPN and then put two expensive boxes in addition to that existing equipment, configured in a way that couldn't work. That all got ripped out when they left and put back how it was.
Hell, they charged us extra to "pre-configure" a server, which turned out to mean "they put the default Windows ISO on it". They never actually LICENCED it or activated it, though. So when it fell over - and we were having nothing to do with that device - because of the activation timebomb, that server just went down. Again, in the middle of the working day. And they'd been the ones to move several critical services over to it without asking. Their response to that was pathetic, too, and still they managed to cling on. The next week, the on-board network cards on it failed and they were entirely useless and couldn't diagnose it, and refused to send anyone to site for at least 24 hours. Again, middle of the working day, critical services down. Again, someone hand-waved it away while screaming at US when it was nothing to do with us.
It took 18 months to get a huge body of evidence against them, far beyond what should have been necessary, dozens of incidents of downtime, enormous arguments and confrontations, even accusations of "obstruction" aimed at us (but pre-guarded against because I'm no fool, so that particular project was actually entirely hands-off on our end, with simple end-goals, which were verified by them to be their responsibility and a "simple job", and they were provided with everything necessary and given many days of supervised assistance where I had my team on-hand to help but recording everything they were asked to do in case of reprisals later). They couldn't get it working, went crying to senior management, and tried to blame us for failing to live up to their own promises.
It eventually took half a dozen IT experts, some independent but also including at least three who were personal friends of the big bosses, who all immediately agreed with us and not the MSP, before we ever got any motion on it. And then it took the CEO of the MSP literally YELLING at me down a Teams call... which the bosses were all able to view... to actually get rid of them.
I had warned my employer that MSPs and in-house teams DO NOT MIX, no matter what contractual boundaries you're supposed to have put in place, but especially if nobody is managing those boundaries.
And eventually that workplace were then led to discover that all the things *they'd* been told in secret meetings with that MSP were lies, and finally admitted so to us. To the point of us being instructed: "Well, if we got rid of them, could you get it working again with what you have?". We could. They did. So we did. In a fraction of the time and manpower that the MSP had already tried to do so. And the response from senior management when we presented the working system: "Okay, so they absolutely were lying to us all along then. They said it wasn't possible.". They had been contracted to basically do anything necessary if we were unavailable, they were supposed to be our path of escalation, substitute and "mentor" us on how to properly configure servers and services (haha!), so it was all entirely within their remit and they couldn't do the simplest of things.
I stuck around purely to prove my innocence (I do often treat that part as a game, I have to say, because I *know* I'm not screwing anyone over and that all my reasons that I initially give are exactly what we'll end up circling back to and coming to the same conclusion, it's just a case of how long that "MTTI" actually takes in most cases). Then once we were done and dusted and I'd proven my case, I took the first job offer that came my way. Incredibly, that workplace swore off any MSP use and hired only on-site staff again to replace me, which I wasn't expecting but at least was the right thing to do!
My MTTI took a bashing, but not because of me. There are still no incidents of "non-innocence" on my part to impact it, though, even after 20+ years in the job. So it'll return to average over time. Generally a few weeks, in fact.
Was surprised this article was neither about Trump who always seems to be claiming his innocence and is decidedly mean (or is that cheap in en_US?) nor even the meaness of "paedoguy" sledging Musk whose innocent required a dubious judicial decision.
Liz Holmes is likely looking at a MTTI of 11 years unless Trump pardons that dazed bunny during his next administration.
The whole ServiceNow pitch triggered my bullshit detector - I suspect it is a right load of codswallop design to relieve gullible organizations of even more of their hard earned. Who are you going to call when trying to shrink your MTTI when dealing with ServiceNow?
As far as I can see the whole problem is fundamentally structural. Once an organization devolves critical infrastructure to mercenaries it can only really expect to lie back and think of England.
When your dashboard reports a problem, how do you know whether the problem is with the monitored system, or with the dashboard system?
I've seen/had to use quite a few flakey dashboard and/or device-administration programs. Otherwise-working devices not showing up on the dashboard, or not responding to commands from the device-administration program were common failure modes. Yet I could print to those printers, or access the C$ admin share of those PCs.
When the crime is effectively still in progress, everyone is guilty until their alibi checks out, one culprit is positively identified, or the outage ends.
I've been on enough of those calls to really like the term MTTI, but also to figure that one more pane of glass (ServiceNow or not) isn't likely to reduce my own MTTI for calls like that. Anyway my value to those calls usually extends beyond proving my (org's/component's) innocence, to guiding (sometimes herding) the combined troubleshooting efforts toward nailing down the true root cause and potential mitigations and fixes.
Anyway my value to those calls usually extends beyond proving my (org's/component's) innocence, to guiding (sometimes herding) the combined troubleshooting efforts toward nailing down the true root cause and potential mitigations and fixes.
This kind of job is among the next ones to be replaced by an AI :-/
I'm happy for someone else (or something else) to do it - I've plenty of other demands on my time that my employer is happy to have me do. And if AI starts taking over the rest (or even the majority) of my job-related functions, well, we're all a lot closer to eutopia or dystopia than I'd anticipated and either way I'm going to sit back and enjoy the ride.
I heard about the guy whose code had the highest number of abends/sigv. He was leaned on to "fix it", even though he said his code was innocent.
To please the bean counters, he added code to value check every parameter passed to his routine (which doubled the path length). Anything that failed got passed back a "error in parameters" return code.
Suddenly his code was getting no problems, but every one else's code was reporting "error in parameters".
He compromised by providing two versions of the module. One for "friends" who passed in good data, and "general" for every one else.
> Mean Time To Innocence (MTTI) – the amount of time it takes to prove that a problem isn't your fault.
I generally find that as a first approximation, blaming a problem on the network usually works. If they somehow manage to wriggle out of it, then laying the blame at the door of whichever software back-ends the problem application is a good second bet.
"Our monitoring board isn't showing any issues so it can't be our fault".
I wonder why that Mean Time To Innocence is considered to be useful at all.
If you provide a service to me and i'm paying for a support contract i'm expecting you to work with me and my other suppliers to help figure out what is going wrong even if the fault is not in the area you are responsible for. Priority should be to fix the issue - figuring out who to blame (and maybe bill) for can wait until that is done.
If you make me prove to you the issue is in your responsibility area before even looking into the issue why am i paying a support contract?
If it is your fault you have to fix it on your cost anyway.
I agree, and have never had a problem staying on a call to assist with testing and troubleshooting an outage. As far as I'm concerned, my job isn't done until the circuit is up.
The only exception to this are certain customers who refuse to accept that their gear is an issue. And, these customers are the ones whose "technicians" are dinged for time on trouble tickets that are not out to us. We test and troubleshoot, tell them we've cleared the circuit, see this and that needs to be done, then 5 minutes later out ticket opens back up with "Still see circuit down." Yes, and you were told to dispatch a vendor. For these customers, I'll close at the end of my shift if they aren't willing to check their own gear.
I hate SNOW! It's a whole process to dumb down the organisation and allow, especially Outsourcers, to get away with murder.
Call Centre staff get paid to answer every call regardless of fix so end up shovelling it up the stack rather than fixing it because THATS the way to meet your SLAs. I haven't worked in one organisation where SNOW added any value. All it did is make the entire IT Organisation a ticket and queue operation with absolutely no aim to fix the actual problem.
Even remember a conversation with several Service Delivery Managers who, before htey had finished talking about how to assign a ticket to someone and get them to do a job, I'd texted the guy, asked him to do the job, he'd done the job and told me he'd finished while the SDM's were STILL arguing about how to assign the job.
Pointless, Pointless, Pointless.
IT Architects should have a TOTAL view of whats going on. If the architect doesn't know about it and hasnt' documented it, the software/cloud app shouldn't be there. Application Developers should be slapped again and again until they understand the concept of documentation.
Better to spend the £millions it costs to implement SNOW on more competent staff, less powerpoint presentations (oh sorry Slide Decks) and have people who have a handle on the environment, who are well paid enough that they won't leave after 3 months becasue they are treated so badly.
I refused to even log in on SNOW in my last job
Echo this. At last job, real work happened in spite of service-no, not because of it.
Bad enough when the Central IT Bureaucracy[tm] brought in service-no as their ticketing system (essentially; no doubt they referred to it as "ITSM" or whatever); at least the thing still had an email gateway of sorts, so you could reply to service-no ticket messages and update the ticket that way. Because the web page was a bodge, and updating things with it was a chore even on a good day.
But when they started expanding the service-no pile into other areas, things got worse. One I remember was taking the inventory + rack elevation tool developed in-house, which generally wasn't as good as other solutions available (e.g RackTables) but had become adequate over time, and replacing it with some new "module"(?) from a company which Service Now presumably had bought. It lacked existing features of the original tool from the start, support and response to issues was uneven, performance of the tool was slow, data/content updates lagged, searching for nodes within the inventory was cumbersome, etc.
The supposed idea behind it all was the company would be able to "free up" (probably let go) the internal devs and ops people responsible for the original tool, bring in this "integrated" tool (read: bolted on the side, different url, different servers, some in cloud some possibly(?) in-house) and somehow magically create efficiencies with the IT service-no installation. Nobody mentioned "saving costs", at least not publicly, probably because the thing was more spendy than the already-developed tool they threw out.
It was somewhat ironic: users bemoaning the loss of a previous tool which people complained about from time to time, after having a service-no implementation inflicted on them.
Outfits like Service Now are insidious. They get their first hooks in, like the camel nose in the tent, and keep going by convincing IT management types "if you just buy the next module too, miracles will happen!". Atlassian is basically the same way.