back to article Monitoring is simple enough – green means everything's fine. But getting to that point can be a whole other ball game

Monitoring seems easy in principle. There is nothing particularly complex about the software or the protocols it uses to interrogate systems and deliver alerts, nor is deciding what to monitor or the act of setting up your chosen product. Yet although it's common to find monitoring done pretty well, it's very rare to find it …

  1. Anonymous Coward
    Anonymous Coward

    It's my job

    Single source of truth

    Doesn't matter if the source of truth is the monitoring system or some other system - but if it's anything else then the monitoring system must pull the truth regularly.

    1. Anonymous Coward
      Anonymous Coward

      Re: It's my job

      I think the point here is that in a company where there is a large mix of technologies there isn’t a one size fits all product. So it’s getting the best of breed for the myriad of technologies and then producing a single concise picture made up of the output from different monitoring tools.

      Alas around here we have a one size doesn't really fit anything….

  2. FeepingCreature Bronze badge
    Go

    Green means

    Green means the probes have been broken for half a year of releases and reduce to "return True".

    Never trust a monitoring readout that you haven't seen report an error.

    1. Joe W Silver badge

      Re: Green means

      Well.... yes, for some. We have this "clash of cultures" in our company: one group believes (and by the official definition of what they are supposed to do and report) that "no valid measurement" means everything is alright (in dubio pro reo, et c.). My group interpretes this as "the monitoring is not working and we are concerned", but not as a failure of the monitored system. Depending on who we are reporting to it is treated in different ways. Internally I treat it as critical.

  3. The First Dave

    Sorry, but what is the author trying to sell?

    1. DJV Silver badge
      Mushroom

      Enjoy the downvotes

      Dave Cartwright is someone who was one of the administrators (sorry, I can't remember his exact job title) handling the rather extensive IT systems at the University of East Anglia in the UK when I was there as a mature student back in the mid-1990s. Given that he had to deal with a vast array of kit - Windows (3.1 at the time), a very early Linux lab, tons of 680x0 Macs and various Unixy type things elsewhere along with all the infrastructure that held all that lot together, his experience and expertise is, therefore, something that should be appreciated, absorbed and not insulted, which I suspect was your intention.

      1. This post has been deleted by its author

      2. The First Dave

        Re: Enjoy the downvotes

        I suppose a minor insult WAS intended - my point being that having read the article, I was left with nothing more than just "do everything perfectly", rather than any useful insight, which left me expecting a pointer to some magic product.

        1. JerseyDaveC

          Re: Enjoy the downvotes

          To be honest I'd love a pointer to some magic product: anyone who can sell me this mythical beast will be welcomed with open arms. And as DJV notes, I can assure you that the original author (me) has nothing to sell in this respect - formerly an enterprise networks and database person I'm now a security guy in a financial business, not a purveyor of fine technology.

          I kinda see your point that you've read it as "do everything perfectly", but what I was trying to get at is that utopia is "green means everything is working as it should" but also that it's darned hard to get to a level of monitoring that gives you that, and even harder to stay there in the light of all the inevitable change within one's infrastructure. I've seen it done once, and it was really hard work staying good, but heck, it was worth it. Problem is that one false move mucks it all up.

  4. Potemkine! Silver badge

    Green means everything's fine

    Use of color affects the accessibility of your software to the widest possible audience. Users with blindness or low vision may not be able to see the colors well, if at all. Approximately 8 percent of adult males have some form of color confusion (often incorrectly referred to as "color blindness"), of which red-green color confusion is the most common.

    source: https://docs.microsoft.com/en-us/windows/win32/uxguide/vis-color

    GUI 101: Never rely on color only to display an information.

    1. Giles C Silver badge

      Re: Green means everything's fine

      As someone who is colourblind green and amber are the same. Red and green are also similar.

      But then I see grass as orange. The older incandescent bulb traffic lights were orange red white - the newer LED ones I do see a colour in the green now.

      And Ethernet cables are orange, orange, red, blue.

      So any monitoring system that goes amber or red is useless to me.

      Put a big cross on it or a tick (which can be coloured) not the stupid circles

      1. Anonymous Coward
        Anonymous Coward

        Re: Green means everything's fine

        An audible warning is good too. Best monitoring setup I've seen had screens at shoulder height covering the width of the IT department and speakers that would softly ping for "flapping" states and go full aawwoOOGAA klaxon for failed states.

      2. Roger Kynaston

        Re: Green means everything's fine

        green ticks = high levels of verdancy

        yellow/amber circles = buttercups in the field

        red crosses = squashed rabbits.

      3. JerseyDaveC

        Re: Green means everything's fine

        Colour vision is something people forget, unless they have colour vision deficiencies (CVDs) themselves! You're absolutely right that indicators need more than just colours - shapes are handy, as are completely separate sections (if you have a box on the screen labelled "BAD STUFF" and put the alerts in there, it can help).

        I'm fortunate in having a fairly mild red-green CVD that doesn't really affect me day-to-day, but given that some sources say it's reckoned to affect up to 8 per cent of men of northern European origin, that's a lot of IT guys who have trouble differentiating "good" from "bad" on a monitoring screen, particularly when the blobs/lines are close together. (I can point at something that's red, and I can point at something that's green, but give me an Ishihara plate with them all mixed together I'm doomed).

        If you're about to scream "discrimination", red-green CVDs are a male trait - only half a per cent or so of women of similar origin have a red-green CVD. And you may also be interested that green traffic lights aren't green: they're greeny-blue because people who can't tell green from red often can tell the difference between greeny-blue and red because of the blue component.

    2. ButlerInstitute

      Re: Green means everything's fine

      Our alarms and monitoring system (for broadcasters) sets the colours of indicators based on a stylesheet. (Alarms are shown on screen, written to database, optionally sounders in control rooms - always as defined by local admins)

      Usually Alarm is red with white text, Ok is green / white.We also have Acknowledged(orange / black), Latched(khaki / white), Ignored(grey / black), Forced On(yellow / red), Forced Off(yellow / green).

      But as a stylesheet they can be changed. Also, there's usually text "alarm" or "ok".

      I think broadcasting is (or was until quite recently) an industry where operators could be required not to be "colour-blind". It's hard to have someone in a control room who can't look at outputs and be sure they are correct (correct image and colour).

      There are different noises depending on assigned severity. Some things need action immediately (the output's gone black, or silent, or the video has frozen) so have noises giving appropriate sense of urgency.

      1. Giles C Silver badge

        Re: Green means everything's fine

        How not to design a system for colour blind users

        Usually Alarm is red with white text,

        Ok is green / white.

        Both those depending on shade could be the same to me.

        We also have Acknowledged(orange / black), Latched(khaki / white), Ignored(grey / black),

        I could probably tell those apart.

        Forced On(yellow / red), Forced Off(yellow / green).

        Again they would be very hard to tell apart.

        Being serious before you employ anyone on your service desk make then undertake a colour vision test.

        1. ButlerInstitute

          Re: Green means everything's fine

          Like I said, they are in a stylesheet so they can be changed. So it's ultimately the users' choice.

          It may be black text on red for alarms, so I would presume there you could tell the difference between black and white text.

          And the users are in a position where they can avoid having anyone without normal colour perception in positions where it is relevant.

  5. Mike 137 Silver badge

    Working within tolerance?

    "The basic thing you want from your monitoring is to show that, when everything is working within tolerance, all indicators are green"

    As long as the dashboard shows green right across, it's generally assumed that this represents normality. But it doesn't. It represents someone's assumptions about normality that got locked into the machine at some point in time. Thereafter the machine is trusted implicitly. For example, when Equifax let its scanner decryption certificate expire, all lights remained green because all data (including the malicious attack data) passed through unchecked. They're not alone. In my several decades of consulting, I've practically never met an organisation that was aware of what normality actually looks like.

    In order to monitor successfully, you have to monitor the changing operation of your systems, the changing external environment and the changing internal environment - all the time. So you're mostly monitoring change, but to do that you have to know what the starting point is. The big problem, obviously, is that the starting point today may not be the same as yesterday's. So the normality you need to find is not a set of static values for monitored points, but a typical rate of change for each point. in order to identify the rate at which change is happening and compare that with the "normal" rate of change. The "tolerance" will be (for example) an acceptable variation in the rate of change at each monitored point that exemplifies the cycle of normal business activities round the clock.

  6. Jay 2

    At my place it's not necessarily the monitoring that's the problem... but the alerting.

    We've currently got an ageing Zenoss setup that we're going to be migrating to Prometheus. So the other day there was a memory-based issue on a JVM, the trigger for utilisation was passed and Zenoss sent a few emails out. But no-one in the team it goes to took any notice (for whatever reason). As a result the JVM wasn't very happy.

    The leader of that team then said the problem was as the alert was an email, they get too many of them etc... So it was really important we get that sort of thing switched over to Prometheus so they could be alerted via a chatbot. Now he'd obviously forgotten but no so long ago they were receiving disk space alerts for an application via Prometheus/chatbot and you can probably guess what happened. Yes, that's right, there was too much noise from some other servers and the really important disk space alert was missed and the application ground to a halt.

    Just moving to a newer/sexier monitoring/alerting platform won't always solve your problems. But on the other hand it will solve some of mine as I look after the current Zenoss setup, but the newer Prometheus/ELK stuff is run by another team. So the day I'm (mainly) no longer on the hook for any application monitoring will be a happy day indeed! I'll still have to worry about system monitoring, but that's far easier and less grief-laden.

    1. Keith Langmead

      "The leader of that team then said the problem was as the alert was an email, they get too many of them etc"

      IMHO that's one of the most critical things with any monitoring setup I've worked with, getting the dependencies setup correctly to reflect reality, and preventing those receiving the alerts from being flooded with alerts.

      Eg if you have :

      Firewall > Switch > HV Server > Virtual Server > various services on the server

      all being monitored, where if something breaks and stops working then everything to the right of it will also be down, you only want to be alerted to the most critical item that's down on the left, otherwise for instance while you're scanning through the alerts for services on a VM that are showing as down, you can easily miss that the Firewall in front of it all has stopped responding. Off the back of that you need to have an understanding of how the infrastructure fits together, which services rely on each other etc.

      At a basic level monitoring's simple, but once you start digging into it it can become a minefield.

      1. Giles C Silver badge

        Dependencies are very hard to get right and solar winds makes a complete screwup of it.

        I remember one package that gave you a list and then you could just drag items into a hierarchy and it controlled the cascade of alerts that way.

        Ie if the links to a site went down then you know the site is dead until the links are back up so don’t tell me all 300 items are down as well.

        1. J. Cook Silver badge

          GODS YES.

          It's a case of "Yes, the links to site (x) are down, don't spam with with alerts that everything at that site is down", but unless it's programmed in like that (and by default, it is NOT!), it can quickly lead to alert fatigue.

          The former boss I named El Turkey wanted us to get an alert on Every. Single. Switch. Port. in the event someone rebooted their machine, or something went down. our admin for solar winds said No, but if he insisted, that he'd be the one to get all the alerts.

          You ever hear a phone just sit and do nothing but spam out 'text message received' alerts for ten minutes solid? It's not fun, and I've sat through that hell exactly once when we had a site go down hard because of the sheet number of crap that we have monitored there legitimately. But if there was an alert for each of the several hundred (or thousand plus, I think) switch ports? the phones would still be beeping away at us even though the incident happened several years ago...

        2. JerseyDaveC

          Yep, you're not wrong there. Dependencies are a pain in the backside and it's hard to get a complete dependency map.

          You might like to have a look at moogsoft.com in this context, which I say not from a commercial viewpoint (I know Phil and Mike from my journo days back in the 1990s - they were with Micromuse making NetCool when I first knew them) but because their AIOps stuff is quite clever. It's all about dependencies but it does it in a rather novel way by looking at log data from different sources and reasoning that if your app, your database, your server and your SAN are all giving related errors at the same time, they're probably related.

      2. Dvon of Edzore

        Alert communications also need to help the recipient understand the impact on the business of what's being alerted. Simplifying the alert to say "Firewall A3 is down" tells a typical C-level person nothing useful. Including "The following key systems are affected: Payment card processing halted, electronic funds transfer halted, electronic deposit processing halted" will encourage approval of better equipment or secondary services to avoid a repeat incident.

        1. ButlerInstitute

          Most broadcasters don't have "C-level" people in control rooms.

          Anyone seeing an alarm should understand its implications. Or there will be someone else on duty to refer to,

      3. ButlerInstitute

        We have a masking system where the output of one alarm can mask another to appear off.

        That's used for this case where the data points to the left will mask those to the right so they don't generate alarms,

      4. SImon Hobson Bronze badge

        where if something breaks and stops working then everything to the right of it will also be down, you only want to be alerted to the most critical item that's down on the left

        Nagios does that out of the box provided you tell it which items depend on which other items. It's far from perfect since it assumes that if (say) there's two links between ${somethings} then "either link up" == "${somethings can talk to each other}". But it does exactly what's described - it will send an alert for the mother item, but only flag the child items on the console.

        At one time (at my last job) I had monitoring for each individual site hosted on our web servers after a SNAFU led to a situation with lots of sites being served up by a default config which happened to be one client's site. If the site itself was bad (fetch of a page didn't return a particular string) then that would get flagged up, but of the server went down, then while sites would slowly turn red on te console, they'd not result in emails being sent.

        The biggest problem I had was a) no manglement support to monitor anything, b) no budget, c) no support from colleagues so I literally had to detect changes by seeing when I didn't have all-green and working out what had been changed. But the hell-desk did actually like it as a quick reference for when the phones started ringing.

  7. no user left unlocked

    Its a partial description of your IT farm.

    The article is pretty accurate but I don't agree that missing a device automatically invalidates your monitoring, most everything else is still valid but things that might touch or be touched by the rogue entity are potentially compromised. Always shades of grey.

    What matters there is that when something is added to the farm you have a way of seeing it, whether it is IP discovery scans, DNS additions, new mac addresses appearing in DHCP or whatever so you can track it down and absorb it.

    When adding a metric I've already gone past the is it needed phase and look at how it is to do its job, is it a binary check, simple warning/critical thresholds or is it something more contextual, all to try and avoid false positives.

    Once in then like every other alert it gets tested every few months manually to make sure it still works but if an alert is working and generating alerts which are not getting resolved then the alert itself is challenged as not needed or incorrectly bounded. Where possible alerts should always be an exception.

    I usually joke that if I'm doing my job well enough then I've just done myself out of a job because everything important is now covered and there is nothing really for me to add.

    1. Steve Aubrey
      Thumb Up

      Re: Its a partial description of your IT farm.

      "Miss one system and your monitoring infrastructure is immediately invalid."

      Agree with Unlocked - that isn't invalid, though it may be imperfect.

      1. J. Cook Silver badge

        Re: Its a partial description of your IT farm.

        There's also the case of "do we really need to be alerted if the test environment goes down?", which is kinda silly, but some people are insistant on that stuff as well...

        1. Fred Daggy Silver badge

          Re: Its a partial description of your IT farm.

          That sounds to me like someone's running production in the test.

          Mostly so they can bypass the various approvals, tests, signoffs, go live. Or even purchase of equipment.

          On the other hand, development (not test) might need to be monitored. I am thinking about highly paid (and not so highly paid) developers needing development systems to be up.

    2. Dvon of Edzore

      Re: Its a partial description of your IT farm.

      "I usually joke that if I'm doing my job well enough then I've just done myself out of a job because everything important is now covered and there is nothing really for me to add."

      What you add is your current experience. You better understand the relationship of the components to the business, and can modify the monitoring and reporting to be more useful for your specific business case.

      You can also work on cross-department relationships so changes can be anticipated and the infrastructure ready to meet future company goals as they arrive. For example, adding a first-ever third-party sales representative in a different country can bring a nightmare of compliance issues that did not exist before. You'll really appreciate time to get understanding and documentation of requirements and costs when you're blindsided in a meeting and have to either commit to doing something you have no idea how to accomplish or be seen as obstructionist to company progress.

    3. Anonymous Coward
      Anonymous Coward

      Re: Its a partial description of your IT farm.

      Yup, the invalid statement was itself invalid. Sounds like a management absolute diktat rather than an experienced IT worker reality.

      I would also add that you can also have too much monitoring - again often the suggestion of managers who don't have to actually deal with alert snowstorms - monitoring needs to alert you to the things that matter, not just things that are monitored "because you can".

  8. Anonymous Coward
    Anonymous Coward

    "The story your monitoring console gives you must be unequivocal, and the moment you succumb to someone begging for an exception is the moment it all goes south."

    Alternatively, the exception is cast out of the monitoring umbrella. If your extra sensitive legal server fills a hard drive, you'll find out when the lawyers start complaining. Your wizzbang testbed server craps itself, you'll find out when you lost a bunch of time on a test run. You get to deal worth the consequences.

    That approach won't work in all companies, though.

  9. ButlerInstitute

    Other states in alarms and monitoring

    As well as "it's ok right now" and "it's reporting a fault right now" we have :

    "it reported a fault recently but it's OK right now - you need to acknowledge that"

    "the device's reporting is noisy so we don't treat it as a real fault unless it reports for more than a certain time" (and other time processing)

    "Yes it's a real fault, but we've booked an engineer for tomorrow morning so we want to ignore it until after that time"

    This is for broadcasting, so some faults are "it's gone black live on air, so do something NOW"

  10. PenGun

    LOL. I used to build these when I was bored. They are not rocket science.

    1. ButlerInstitute

      Rocket science is easy - it's the rocket engineering that's the difficult bit.....

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon