back to article Look, we did a survey that shows AIOps is ready for the primetime, says AIOps firm

Adoption of AIOps in IT departments is set to go mainstream, or so says a survey of medium and large enterprises which found 93 per cent of respondents are either already using the tech, or plan to adopt it in the near future. The survey was conducted by StarCIO on behalf of BigPanda, a company that develops an AIOps platform …

  1. Doctor Syntax Silver badge

    The found a few who didn't hit delete on a survey on a topic they weren't interested in?

  2. Blank Reg

    Well it's more believable than the surveys run by those owning lots of expensive and mostly empty office space that claim we all want to go back to the office

  3. Throatwarbler Mangrove Silver badge
    Thumb Down

    AIOps is crap

    I have yet to see any "AI", just a lot of specious notifications which require hand-tuning, the same as with any other monitoring solution.

    1. Nate Amsden

      Re: AIOps is crap

      I tend to agree, have yet to see a solution that can actually do something meaningful. Several years ago my org invested in Splunk ITSI which was/is a form of AIOps I suppose, but it never got off the ground, they ignored my advice to just not waste time with that product (Splunk otherwise is a fine product), and they spent a bunch of $$$ and literally aborted the implementation process a few weeks in(highly custom app stack I knew was going to take a LOT of time to make ITSI work, 10x more than they were expecting). Then spent a bunch of new $$$ on New Relic which is a good tool too but that too requires a lot of tuning especially for custom application stacks. Org spent months trying to tune out the false positives/negatives but never managed to before they aborted that as well. Still have good dashboards and get good data from both tools, so they are worth having, even if the level of automation for finding things isn't there(for us). Having the data in an easy to access manor makes it easier for a human to trace an issue.

      Then I was friendly with a cold call from Pager Duty(already a customer) who tried to sell me on their stuff. Talked to them for I think over an hour and just knew they wouldn't have a workable solution for AIOps.

      I haven't used HP Infosight with Nimble storage but my experience with Infosight on 3PAR is well the only thing Infosight is good for is telling me when a new version of software or some critical patch is out that I should apply. Things like predicting hardware failure has been standard for decades at this point I don't see how something like Infosight changes that, they just moved where the logic was being executed.

      The performance/space monitoring etc is pretty useless, well mostly because I already have better tools for that(better tool being Logic Monitor, they wrote their 3PAR integration based on my in house built perl code though theirs doesn't use perl of course). HP claimed good integration with vCenter but again the resolution of data was not useful, I collect data at 60 second intervals with LM but with Infosight it was more like 5-10 minutes was the minimum at one point(years ago when I talked with their engineering on it). The DBA I worked with didn't even like 60 second intervals he wanted 10 second. I even talked with HPE engineering on deeper level Infosight stuff, saying hey it would be nice if Infosight reported on events like SSD garbage collection(that might account for strange latency spikes on individual SSDs), that kind of stat is not exposed anywhere else so that is an opportunity to differentiate, but at the time (~3-4+ years ago) they had no plans to do that.

      I've only worked in small orgs I guess never more than 500 employees(currently managing about 700 VMs + vmware hosts and network and storage), but I can pretty much always find root cause of an issue within say 3-5 minutes? Sometimes not even that. Getting the issue fixed may certainly take longer. Infrastructure wise it's generally super easy to find the cause of an issue with my setup in Logic Monitor which is monitoring way more things than I've ever been able to prior to finding that product. I've spent a ton of time customizing it and adding support for more things over the years as well. Even did some work on the side for another company to bring their monitoring more up to speed. They had invested $500k+ on some big name monitoring system which wasn't working properly and the cost of LogicMonitor was less than the support renewal for their existing monitoring(and they had LM monitoring more things in 20 minutes than the expensive tool that had people working on it for weeks). But then they wanted someone like me to show them what more could be done. Though it took them a couple years to realize that. They could of done it themselves just probably didn't have the time. I've spent thousands of hours working on LogicMonitor which is a lot, but it is a ~95%+ time savings over previous tools/methods I have used, many of which simply lack the integrations that make monitoring even remotely useful (one such example is monitoring SSL cert expiration on our load balancers, another is deep vCenter API integration).

      I was interested in using the New Relic free account to monitor my home Vmware infrastructure, but oh my god was the integration just absolutely totally worthless. Data dog tried to get me to replace LM with their solution and it was similar, Data dog was useless for what I needed. Though they did excel at other areas where LM fell flat(mainly containers, and data dog's billing process for containers is excellent as well), so there was room for both tools in that case.

      Having full insight into vCenter, storage, network switches, load balancers, server VMs (and the infrastructure apps that run on them because our in house app stack exposes no stats!), firewalls(insight here is fairly limited to basic network stats) and being able to show data from all of them on a single dashboard(mainly graphs) is really handy. I even figured out a way last year to calculate vCPU/pCPU provisioning ratios from vCenter and display that as well(doing that before was a purely manual process for me but I didn't do it often), and I have graphs showing CPU MHz for the VMs so I can see when a particular VM is consuming a lot of resources (because 80% of a 1vCPU VM is obviously very different from 80% of a 8vCPU VM) - these specific metrics are not available in the default LM setup you have to know they exist via the vCenter API and modify the collection processes to get that data.

      Now it takes longer to find the cause if it's for example buried in network packets or something, my tooling doesn't extend to that level. I had a firewall decide to just start dropping packets between a specific source and destination out of nowhere for example(I got an alert from LogicMonitor from my load balancer that uses that specific path fortunately). Packet trace on the FW showed the issue, but no idea the cause(other than bug), had to reboot both firewalls in the pair(one at a time) to get the issue fixed. Vendor had no idea other than "hey upgrade we haven't had this reported on the newer version but we also haven't fixed any specific issues related on the newer version", and "yes the packet trace shows the firewall is not behaving properly with this specific traffic flow".

      I think the bar for these AIOps folks is so low, only the most disorganized dysfunctional orgs that have zero monitoring for example can see real benefits because they are comparing against well, real terrible stuff or nothing at all.

      I remember ~8-9 years ago my switches had a bug with ARP requests, if I did a vmotion of a VM to another host, when the vmotion hit 60% the VM would drop off the network until I flushed the arp table on the switch for that IP address(which I got done after I figured it out, disabled DRS and just did manual vmotions when needed with my finger on the ENTER key on my keyboard to flush the arp when vmotion hit 60%). Took a long time to track down, no errors in the logs but at one point saw the serial console of the switch was spewing low level errors from the chip set(which weren't being stored by the switch OS). Software update fixed it. I haven't seen monitoring that could pick up on that kind of condition, seeing that hey VM is moving from host 1 to host 5, the mac address should be moving from switch 1 port 23 to switch 3 port 37(at the right instant), make sure it moves properly. Such an edge case I don't think anyone would monitor that kind of thing. I remember another switch bug back in 2008 where having sflow enabled on a port caused multicast traffic to fail on that port (had very minimal use of multicast at the time so took a while to figure out) - but again I don't even know how a monitoring system could pick up on something like that.

      Problem with AIOps as a concept is it needs a lot of data, then needs to make sense of the data, there's so much variation out there that I think it's not yet possible to get significantly more value than you can otherwise get just with monitoring lots of data points and dashboards.

      I've been doing custom monitoring going back to the late 90s with MRTG. At one company in about 2004 I had written a ton of stuff using RRDTool + RRDcgi and scripts to parse the data and make graphs(before I started most of their monitoring was simply tail -f on the log files, my manager could interpret apache response time in microseconds and translate to seconds in real time I saw him doing that and determined my brain couldn't think that fast so I wrote a perl script to parse it that was the beginning). I was in a company wide meeting when they tossed my graphs up on the screen. I was horrified. I told them again these graphs, while useful are still considered experimental don't blame me if there are data errors(and there was from time to time, lots of counters and perl regexes). They said they understood and accepted the risk, and they were planning company strategy using my graphs since that was the best data they had available to them. Years after I left they were still using some of my graphs despite having invested heavily in a big name 3rd party monitoring service, turns out couldn't do some of what my graphs had been doing for years already.

      1. Anonymous Coward
        Anonymous Coward

        Re: AIOps is crap

        Nice, but have we really been using MRTG* for 20+ years now?

        * - props to MRTG's authors for making SNMPv3 usable, and simple to configure, as a collection method

  4. Anonymous Coward
    Anonymous Coward

    "The survey was conducted by StarCIO on behalf of BigPanda, a company that develops an AIOps platform, "

    So, safe to say that BigPanda couldn't convince/bribe Gartner to create a Magic Quadrant for AIOps?

  5. Ace2 Silver badge

    “AIOps”?

    Stupid name

    1. Anonymous Coward
      Anonymous Coward

      Re: “AIOps”?

      I'd have to pronounce it "Yawps" :)

  6. Anonymous Coward
    Anonymous Coward

    * We feel beholden to mention that the individuals polled were all affiliated with our organization.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like