..is good. Let's say ELK stack then it's a contender compared to graphite
The BBC and NHS epitomise enterprise: the BBC has 23,000 staff while the NHS is one of the world's largest employers, with 1.4 million. Their IT estate is vast and central to the delivery of their services. The BBC's iPlayer is on the front line in a world of on-demand TV defined by Netflix, and among its layered infrastructure …
Doesn't scale as well as it should for most people. You've got to have a pretty thorough understanding of the underlying indexing and sharding architectures to get it up into the tens/hundreds of thousands of events per second range. Yes it's relatively easy to do but it's fundamentally a square peg in a round hole. Log analytics shouldn't need search, it's a time-series aggregation and filtering problem, which is why Graphite ships with a time-series database. Means you get to spend more time on product and less time on plumbing.
Personally I find it more interesting Splunk doesn't get a look-in. It's a serious player in both SIEM and application/infrastructure monitoring. Not that it's very good, but it's one of the biggest names in the market.
Storage is 'spensive, cannot spend that on monitoring dammit!
Biggest issue I've run into *every* time we've tried to get things set up. Security *has* the syslogs from everything, but politically, and security wise are unable to *share* them to the monitoring team. Duplication of storage (and its a LOT of storage) is unacceptable. And proper monitoring has to be tuned, no matter how much you pay for it, the 'nice little bundle of options' you get from your vendor will *not* cover your scope, and you'll have to tune it for your environment.
And you have to beat your (umptyleven) development teams (agile devops or not) into making sure they have the right bits in their builds, and build to the structure that your monitoring is expecting, or at the very least follow a recipe that looks vaguely like a standard.
And to get it all working and tuned you need people that understand the monitored environment AND the way the tools work, and you have to pay for that. Those bodies are expensive.
God I can't wait to work in a smaller shop one day.
Oh. RIght. Cheep Cloud.
oh well. off the slave pen then.
<can you tell it was a budget week this week?>
Once a Graphite API compatible front end is layered on top of InfluxDB, yes.
Like what InfluxGraph provides - https://github.com/InfluxGraph/influxgraph
InfuxDB by itself can only ingest graphite protocol, it does not natively support the Graphite query API. A drop-in replacement for Graphite requires both.
It is a far better solution for a metrics data store, no doubt about that. Anyone that has ever tried to scale Graphite core, at all, will tell you that Graphite core does not scale at all well.
InfluxDB is many orders of magnitude better in terms of speed and resource usage.
Nobody's being paid to push them as "solutions" at trade shows, despite their demonstrated ability to solve many problems. They just keep doing what they do. Munin's graphs alone have helped me diagnose countless problems. Email notification of out-of-range values is a bonus.
Been doing monitoring stuff for almost 20 years now, started with MRTG, then built my own custom graphing with rrdtool and rrdcgi which handled probably 10 million data points a day back at the time(after I left the company they deployed Zabbix, and 3 years in they were still using my custom graphs), then moved to cacti(which was great/easy for end users but crap for anything else), now my tool of choice is a SaaS platform called LogicMonitor (though it's not nearly as good as it was when I first started using it, I really HATE HATE HATE their new UI that was forced upon the customers, one of the many downsides to a SaaS platform, took 6 months to fix an annoying UI bug that didn't exist in the older UI). The org I am at has had graphite (and collectd, and more recently statsd/grafana) deployed for the past 5 years. Maybe I need 3 hands to count the number of times I have used graphite, it's just such a pain to get data out of. Maybe if you are a math wiz or something. Or maybe there is a (much) better web front end out there that I just haven't seen(I never set graphite up someone else on my team did)
I don't doubt it is a good tool for some out there, but I can't bear to use it, other people on my team use statsd/grafana I haven't spent more than 30 minutes playing with that since it was deployed.
The core things I really love about logicmonitor that I haven't seen elsewhere is the ease of use with dynamic graphs/dashboards. Also integration with a ton of things I use whether it is virtualization, firewalls, switches, load balancers(back in the day I must've spent 200 hours getting complete F5 bigIP stuff into the cacti servers I had at the time), power strips, servers, apps, etc. If there was a non SaaS version of this kind of product I would jump on it but so far have not seen nor heard of anything that comes close in these areas. Logicmonitor does other things too but I only use it for graphs and dashboards. I also ported my custom 3PAR monitoring that I developed for cacti starting about 10 years ago to LM(about 20k data points/minute coming from my arrays), and that works great too, with cacti I literally created over 1,000 graphs by hand for storage alone the last time I used it because it wasn't built to do what I was trying to use it for. In the end it worked, but there was so much manual work involved in maintaining it.
The NHS and BBC are stuck in the stone age (having worked for both) not supprised that central monitoring is based on Out of date, unsupported Open source software. To tell the truth, users are quicker at reporting issues than monitoring in both organisations.
The NHS is a nightmare, basically because there is no such thing anymore, its a miriad of local trusts competing with each-other to feed on the scraps and each organisation has its own solution for everything, that isn't entirely compatable with its neigbours, the DfH can't even decide if there should be a standard, let alone what it should be (ongoing since the 90's).
The BBC is also a state, while i was there, a switch blade in one of the buildings (yes, still using chassis) got rebooted and we were chasing issues for 3 weeks. When they went to windows 7, they didn not test the main radio broadcast software for compatability and guess what it didnt work (cue dead air.) the old divisions are fighting with each other and no-body knows whats going on.