Monitoring is simple enough – green means everything's fine. But getting to that point can be a whole other ball game • The Register Forums

Tuesday 22nd June 2021 09:38 GMT Anonymous Coward

It's my job

Single source of truth

Doesn't matter if the source of truth is the monitoring system or some other system - but if it's anything else then the monitoring system must pull the truth regularly.

10 0 Reply
1. Wednesday 23rd June 2021 11:00 GMT Anonymous Coward
  
  Re: It's my job
  
  I think the point here is that in a company where there is a large mix of technologies there isn’t a one size fits all product. So it’s getting the best of breed for the myriad of technologies and then producing a single concise picture made up of the output from different monitoring tools.
  
  Alas around here we have a one size doesn't really fit anything….
  
  1 0 Reply
Tuesday 22nd June 2021 10:35 GMT FeepingCreature

Green means

Green means the probes have been broken for half a year of releases and reduce to "return True".

Never trust a monitoring readout that you haven't seen report an error.

12 1 Reply
1. Tuesday 22nd June 2021 11:33 GMT Joe W
  
  Re: Green means
  
  Well.... yes, for some. We have this "clash of cultures" in our company: one group believes (and by the official definition of what they are supposed to do and report) that "no valid measurement" means everything is alright (in dubio pro reo, et c.). My group interpretes this as "the monitoring is not working and we are concerned", but not as a failure of the monitored system. Depending on who we are reporting to it is treated in different ways. Internally I treat it as critical.
  
  4 0 Reply
Tuesday 22nd June 2021 11:57 GMT The First Dave

Sorry, but what is the author trying to sell?

1 5 Reply
1. Wednesday 23rd June 2021 17:37 GMT DJV
  
  Enjoy the downvotes
  
  Dave Cartwright is someone who was one of the administrators (sorry, I can't remember his exact job title) handling the rather extensive IT systems at the University of East Anglia in the UK when I was there as a mature student back in the mid-1990s. Given that he had to deal with a vast array of kit - Windows (3.1 at the time), a very early Linux lab, tons of 680x0 Macs and various Unixy type things elsewhere along with all the infrastructure that held all that lot together, his experience and expertise is, therefore, something that should be appreciated, absorbed and not insulted, which I suspect was your intention.
  
  2 0 Reply
  1. This post has been deleted by its author
  2. Friday 25th June 2021 11:20 GMT The First Dave
    
    Re: Enjoy the downvotes
    
    I suppose a minor insult WAS intended - my point being that having read the article, I was left with nothing more than just "do everything perfectly", rather than any useful insight, which left me expecting a pointer to some magic product.
    
    1 0 Reply
    1. Friday 27th August 2021 20:45 GMT JerseyDaveC
      
      Re: Enjoy the downvotes
      
      To be honest I'd love a pointer to some magic product: anyone who can sell me this mythical beast will be welcomed with open arms. And as DJV notes, I can assure you that the original author (me) has nothing to sell in this respect - formerly an enterprise networks and database person I'm now a security guy in a financial business, not a purveyor of fine technology.
      
      I kinda see your point that you've read it as "do everything perfectly", but what I was trying to get at is that utopia is "green means everything is working as it should" but also that it's darned hard to get to a level of monitoring that gives you that, and even harder to stay there in the light of all the inevitable change within one's infrastructure. I've seen it done once, and it was really hard work staying good, but heck, it was worth it. Problem is that one false move mucks it all up.
      
      0 0 Reply
Tuesday 22nd June 2021 13:33 GMT Potemkine!

Green means everything's fine

Use of color affects the accessibility of your software to the widest possible audience. Users with blindness or low vision may not be able to see the colors well, if at all. Approximately 8 percent of adult males have some form of color confusion (often incorrectly referred to as "color blindness"), of which red-green color confusion is the most common.

source: https://docs.microsoft.com/en-us/windows/win32/uxguide/vis-color

GUI 101: Never rely on color only to display an information.

9 0 Reply
1. Tuesday 22nd June 2021 20:08 GMT Giles C
  
  Re: Green means everything's fine
  
  As someone who is colourblind green and amber are the same. Red and green are also similar.
  
  But then I see grass as orange. The older incandescent bulb traffic lights were orange red white - the newer LED ones I do see a colour in the green now.
  
  And Ethernet cables are orange, orange, red, blue.
  
  So any monitoring system that goes amber or red is useless to me.
  
  Put a big cross on it or a tick (which can be coloured) not the stupid circles
  
  8 0 Reply
  1. Wednesday 23rd June 2021 11:44 GMT Anonymous Coward
    
    Re: Green means everything's fine
    
    An audible warning is good too. Best monitoring setup I've seen had screens at shoulder height covering the width of the IT department and speakers that would softly ping for "flapping" states and go full aawwoOOGAA klaxon for failed states.
    
    0 0 Reply
  2. Wednesday 23rd June 2021 13:51 GMT Roger Kynaston
    
    Re: Green means everything's fine
    
    green ticks = high levels of verdancy
    
    yellow/amber circles = buttercups in the field
    
    red crosses = squashed rabbits.
    
    1 0 Reply
  3. Friday 27th August 2021 20:57 GMT JerseyDaveC
    
    Re: Green means everything's fine
    
    Colour vision is something people forget, unless they have colour vision deficiencies (CVDs) themselves! You're absolutely right that indicators need more than just colours - shapes are handy, as are completely separate sections (if you have a box on the screen labelled "BAD STUFF" and put the alerts in there, it can help).
    
    I'm fortunate in having a fairly mild red-green CVD that doesn't really affect me day-to-day, but given that some sources say it's reckoned to affect up to 8 per cent of men of northern European origin, that's a lot of IT guys who have trouble differentiating "good" from "bad" on a monitoring screen, particularly when the blobs/lines are close together. (I can point at something that's red, and I can point at something that's green, but give me an Ishihara plate with them all mixed together I'm doomed).
    
    If you're about to scream "discrimination", red-green CVDs are a male trait - only half a per cent or so of women of similar origin have a red-green CVD. And you may also be interested that green traffic lights aren't green: they're greeny-blue because people who can't tell green from red often can tell the difference between greeny-blue and red because of the blue component.
    
    0 0 Reply
2. Wednesday 23rd June 2021 18:27 GMT ButlerInstitute
  
  Re: Green means everything's fine
  
  Our alarms and monitoring system (for broadcasters) sets the colours of indicators based on a stylesheet. (Alarms are shown on screen, written to database, optionally sounders in control rooms - always as defined by local admins)
  
  Usually Alarm is red with white text, Ok is green / white.We also have Acknowledged(orange / black), Latched(khaki / white), Ignored(grey / black), Forced On(yellow / red), Forced Off(yellow / green).
  
  But as a stylesheet they can be changed. Also, there's usually text "alarm" or "ok".
  
  I think broadcasting is (or was until quite recently) an industry where operators could be required not to be "colour-blind". It's hard to have someone in a control room who can't look at outputs and be sure they are correct (correct image and colour).
  
  There are different noises depending on assigned severity. Some things need action immediately (the output's gone black, or silent, or the video has frozen) so have noises giving appropriate sense of urgency.
  
  1 0 Reply
  1. Thursday 24th June 2021 08:23 GMT Giles C
    
    Re: Green means everything's fine
    
    How not to design a system for colour blind users
    
    Usually Alarm is red with white text,
    
    Ok is green / white.
    
    Both those depending on shade could be the same to me.
    
    We also have Acknowledged(orange / black), Latched(khaki / white), Ignored(grey / black),
    
    I could probably tell those apart.
    
    Forced On(yellow / red), Forced Off(yellow / green).
    
    Again they would be very hard to tell apart.
    
    Being serious before you employ anyone on your service desk make then undertake a colour vision test.
    
    0 0 Reply
    1. Thursday 24th June 2021 09:04 GMT ButlerInstitute
      
      Re: Green means everything's fine
      
      Like I said, they are in a stylesheet so they can be changed. So it's ultimately the users' choice.
      
      It may be black text on red for alarms, so I would presume there you could tell the difference between black and white text.
      
      And the users are in a position where they can avoid having anyone without normal colour perception in positions where it is relevant.
      
      0 0 Reply
Tuesday 22nd June 2021 13:41 GMT Mike 137

Working within tolerance?

"The basic thing you want from your monitoring is to show that, when everything is working within tolerance, all indicators are green"

As long as the dashboard shows green right across, it's generally assumed that this represents normality. But it doesn't. It represents someone's assumptions about normality that got locked into the machine at some point in time. Thereafter the machine is trusted implicitly. For example, when Equifax let its scanner decryption certificate expire, all lights remained green because all data (including the malicious attack data) passed through unchecked. They're not alone. In my several decades of consulting, I've practically never met an organisation that was aware of what normality actually looks like.

In order to monitor successfully, you have to monitor the changing operation of your systems, the changing external environment and the changing internal environment - all the time. So you're mostly monitoring change, but to do that you have to know what the starting point is. The big problem, obviously, is that the starting point today may not be the same as yesterday's. So the normality you need to find is not a set of static values for monitored points, but a typical rate of change for each point. in order to identify the rate at which change is happening and compare that with the "normal" rate of change. The "tolerance" will be (for example) an acceptable variation in the rate of change at each monitored point that exemplifies the cycle of normal business activities round the clock.

4 0 Reply
Tuesday 22nd June 2021 15:28 GMT Jay 2

At my place it's not necessarily the monitoring that's the problem... but the alerting.

We've currently got an ageing Zenoss setup that we're going to be migrating to Prometheus. So the other day there was a memory-based issue on a JVM, the trigger for utilisation was passed and Zenoss sent a few emails out. But no-one in the team it goes to took any notice (for whatever reason). As a result the JVM wasn't very happy.

The leader of that team then said the problem was as the alert was an email, they get too many of them etc... So it was really important we get that sort of thing switched over to Prometheus so they could be alerted via a chatbot. Now he'd obviously forgotten but no so long ago they were receiving disk space alerts for an application via Prometheus/chatbot and you can probably guess what happened. Yes, that's right, there was too much noise from some other servers and the really important disk space alert was missed and the application ground to a halt.

Just moving to a newer/sexier monitoring/alerting platform won't always solve your problems. But on the other hand it will solve some of mine as I look after the current Zenoss setup, but the newer Prometheus/ELK stuff is run by another team. So the day I'm (mainly) no longer on the hook for any application monitoring will be a happy day indeed! I'll still have to worry about system monitoring, but that's far easier and less grief-laden.

5 0 Reply
1. Tuesday 22nd June 2021 16:20 GMT Keith Langmead
  
  "The leader of that team then said the problem was as the alert was an email, they get too many of them etc"
  
  IMHO that's one of the most critical things with any monitoring setup I've worked with, getting the dependencies setup correctly to reflect reality, and preventing those receiving the alerts from being flooded with alerts.
  
  Eg if you have :
  
  Firewall > Switch > HV Server > Virtual Server > various services on the server
  
  all being monitored, where if something breaks and stops working then everything to the right of it will also be down, you only want to be alerted to the most critical item that's down on the left, otherwise for instance while you're scanning through the alerts for services on a VM that are showing as down, you can easily miss that the Firewall in front of it all has stopped responding. Off the back of that you need to have an understanding of how the infrastructure fits together, which services rely on each other etc.
  
  At a basic level monitoring's simple, but once you start digging into it it can become a minefield.
  
  9 0 Reply
  1. Tuesday 22nd June 2021 20:14 GMT Giles C
    
    Dependencies are very hard to get right and solar winds makes a complete screwup of it.
    
    I remember one package that gave you a list and then you could just drag items into a hierarchy and it controlled the cascade of alerts that way.
    
    Ie if the links to a site went down then you know the site is dead until the links are back up so don’t tell me all 300 items are down as well.
    
    3 0 Reply
    1. Tuesday 22nd June 2021 21:10 GMT J. Cook
      
      GODS YES.
      
      It's a case of "Yes, the links to site (x) are down, don't spam with with alerts that everything at that site is down", but unless it's programmed in like that (and by default, it is NOT!), it can quickly lead to alert fatigue.
      
      The former boss I named El Turkey wanted us to get an alert on Every. Single. Switch. Port. in the event someone rebooted their machine, or something went down. our admin for solar winds said No, but if he insisted, that he'd be the one to get all the alerts.
      
      You ever hear a phone just sit and do nothing but spam out 'text message received' alerts for ten minutes solid? It's not fun, and I've sat through that hell exactly once when we had a site go down hard because of the sheet number of crap that we have monitored there legitimately. But if there was an alert for each of the several hundred (or thousand plus, I think) switch ports? the phones would still be beeping away at us even though the incident happened several years ago...
      
      4 0 Reply
    2. Thursday 24th June 2021 21:01 GMT JerseyDaveC
      
      Yep, you're not wrong there. Dependencies are a pain in the backside and it's hard to get a complete dependency map.
      
      You might like to have a look at moogsoft.com in this context, which I say not from a commercial viewpoint (I know Phil and Mike from my journo days back in the 1990s - they were with Micromuse making NetCool when I first knew them) but because their AIOps stuff is quite clever. It's all about dependencies but it does it in a rather novel way by looking at log data from different sources and reasoning that if your app, your database, your server and your SAN are all giving related errors at the same time, they're probably related.
      
      0 0 Reply
  2. Tuesday 22nd June 2021 20:37 GMT Dvon of Edzore
    
    Alert communications also need to help the recipient understand the impact on the business of what's being alerted. Simplifying the alert to say "Firewall A3 is down" tells a typical C-level person nothing useful. Including "The following key systems are affected: Payment card processing halted, electronic funds transfer halted, electronic deposit processing halted" will encourage approval of better equipment or secondary services to avoid a repeat incident.
    
    5 0 Reply
    1. Wednesday 23rd June 2021 18:30 GMT ButlerInstitute
      
      Most broadcasters don't have "C-level" people in control rooms.
      
      Anyone seeing an alarm should understand its implications. Or there will be someone else on duty to refer to,
      
      0 0 Reply
  3. Wednesday 23rd June 2021 18:29 GMT ButlerInstitute
    
    We have a masking system where the output of one alarm can mask another to appear off.
    
    That's used for this case where the data points to the left will mask those to the right so they don't generate alarms,
    
    0 0 Reply
  4. Thursday 24th June 2021 17:00 GMT SImon Hobson
    
    where if something breaks and stops working then everything to the right of it will also be down, you only want to be alerted to the most critical item that's down on the left
    
    Nagios does that out of the box provided you tell it which items depend on which other items. It's far from perfect since it assumes that if (say) there's two links between ${somethings} then "either link up" == "${somethings can talk to each other}". But it does exactly what's described - it will send an alert for the mother item, but only flag the child items on the console.
    
    At one time (at my last job) I had monitoring for each individual site hosted on our web servers after a SNAFU led to a situation with lots of sites being served up by a default config which happened to be one client's site. If the site itself was bad (fetch of a page didn't return a particular string) then that would get flagged up, but of the server went down, then while sites would slowly turn red on te console, they'd not result in emails being sent.
    
    The biggest problem I had was a) no manglement support to monitor anything, b) no budget, c) no support from colleagues so I literally had to detect changes by seeing when I didn't have all-green and working out what had been changed. But the hell-desk did actually like it as a quick reference for when the phones started ringing.
    
    0 0 Reply
Tuesday 22nd June 2021 16:04 GMT no user left unlocked

Its a partial description of your IT farm.

The article is pretty accurate but I don't agree that missing a device automatically invalidates your monitoring, most everything else is still valid but things that might touch or be touched by the rogue entity are potentially compromised. Always shades of grey.

What matters there is that when something is added to the farm you have a way of seeing it, whether it is IP discovery scans, DNS additions, new mac addresses appearing in DHCP or whatever so you can track it down and absorb it.

When adding a metric I've already gone past the is it needed phase and look at how it is to do its job, is it a binary check, simple warning/critical thresholds or is it something more contextual, all to try and avoid false positives.

Once in then like every other alert it gets tested every few months manually to make sure it still works but if an alert is working and generating alerts which are not getting resolved then the alert itself is challenged as not needed or incorrectly bounded. Where possible alerts should always be an exception.

I usually joke that if I'm doing my job well enough then I've just done myself out of a job because everything important is now covered and there is nothing really for me to add.

5 0 Reply
1. Tuesday 22nd June 2021 17:05 GMT Steve Aubrey
  
  Re: Its a partial description of your IT farm.
  
  "Miss one system and your monitoring infrastructure is immediately invalid."
  
  Agree with Unlocked - that isn't invalid, though it may be imperfect.
  
  3 0 Reply
  1. Tuesday 22nd June 2021 21:11 GMT J. Cook
    
    Re: Its a partial description of your IT farm.
    
    There's also the case of "do we really need to be alerted if the test environment goes down?", which is kinda silly, but some people are insistant on that stuff as well...
    
    4 0 Reply
    1. Wednesday 23rd June 2021 12:58 GMT Fred Daggy
      
      Re: Its a partial description of your IT farm.
      
      That sounds to me like someone's running production in the test.
      
      Mostly so they can bypass the various approvals, tests, signoffs, go live. Or even purchase of equipment.
      
      On the other hand, development (not test) might need to be monitored. I am thinking about highly paid (and not so highly paid) developers needing development systems to be up.
      
      0 0 Reply
2. Tuesday 22nd June 2021 20:51 GMT Dvon of Edzore
  
  Re: Its a partial description of your IT farm.
  
  "I usually joke that if I'm doing my job well enough then I've just done myself out of a job because everything important is now covered and there is nothing really for me to add."
  
  What you add is your current experience. You better understand the relationship of the components to the business, and can modify the monitoring and reporting to be more useful for your specific business case.
  
  You can also work on cross-department relationships so changes can be anticipated and the infrastructure ready to meet future company goals as they arrive. For example, adding a first-ever third-party sales representative in a different country can bring a nightmare of compliance issues that did not exist before. You'll really appreciate time to get understanding and documentation of requirements and costs when you're blindsided in a meeting and have to either commit to doing something you have no idea how to accomplish or be seen as obstructionist to company progress.
  
  0 0 Reply
3. Wednesday 23rd June 2021 08:16 GMT Anonymous Coward
  
  Re: Its a partial description of your IT farm.
  
  Yup, the invalid statement was itself invalid. Sounds like a management absolute diktat rather than an experienced IT worker reality.
  
  I would also add that you can also have too much monitoring - again often the suggestion of managers who don't have to actually deal with alert snowstorms - monitoring needs to alert you to the things that matter, not just things that are monitored "because you can".
  
  0 0 Reply
Tuesday 22nd June 2021 22:43 GMT Anonymous Coward

"The story your monitoring console gives you must be unequivocal, and the moment you succumb to someone begging for an exception is the moment it all goes south."

Alternatively, the exception is cast out of the monitoring umbrella. If your extra sensitive legal server fills a hard drive, you'll find out when the lawyers start complaining. Your wizzbang testbed server craps itself, you'll find out when you lost a bunch of time on a test run. You get to deal worth the consequences.

That approach won't work in all companies, though.

3 0 Reply
Wednesday 23rd June 2021 18:32 GMT ButlerInstitute

Other states in alarms and monitoring

As well as "it's ok right now" and "it's reporting a fault right now" we have :

"it reported a fault recently but it's OK right now - you need to acknowledge that"

"the device's reporting is noisy so we don't treat it as a real fault unless it reports for more than a certain time" (and other time processing)

"Yes it's a real fault, but we've booked an engineer for tomorrow morning so we want to ignore it until after that time"

This is for broadcasting, so some faults are "it's gone black live on air, so do something NOW"

0 0 Reply
Wednesday 23rd June 2021 19:08 GMT PenGun

LOL. I used to build these when I was bored. They are not rocket science.

0 0 Reply
1. Thursday 24th June 2021 09:09 GMT ButlerInstitute
  
  Rocket science is easy - it's the rocket engineering that's the difficult bit.....
  
  0 0 Reply