Some terminology I use are measurement which can be substituted with metric. Observations which are a collection of measurements.
Collecting observations is done by agents. Agents can setup a receiving server (stream), poll data sources or subscribe to a data source (stream). Observations may be processed/transformed here a bit to standardize/normalize measurements since other agents may be collecting the same measurements from different sources.
Once you have the data, you have a good amount of choices on what to do with the data, so here are two: you can transfer immediately to the data sinks or store into a cache where a curator service will push to data sinks.
Before transferring, some more data processing in the transmission process may be needed for formatting purposes.
If data is transferred immediately, then you have to worry about questions such as backfilling data if the sink goes down!
If data is in a cache, how much data do you want to store there before things get dropped? How do you make sure that same data isn't retransmitted from a time period assuming successful transmission? Remember that not all data sources are as quick as others (aperiodicity).
Aggregation is a step done after the initial data collection, but is still part of data collection umbrella. The primary types of aggregation I researched and know of are batch and streaming. Batch aggregations happened when all the data was available and streaming aggregations happened as data got through or when enough data came in to do the aggregation (delayed streaming? mini-batch?). I call it the hydration process where mini-blocks filtered for data in a stream and captured what was needed to perform the aggregation. Good Example for hydration would be for a division of measurements that would come in at different times (usually within 5 seconds of each other).
Aggregations could feed other aggregations! Batch aggregator would feed a streaming aggregator which could feed another streaming aggregator. This is a powerful feature.
Monitoring is an umbrella that contains measurement threshold checks and alerting. Systems monitor values and then immediately trigger alerts based on defined rules. These rules, typically, are simple order relation checks. I think we can start to see that simple rules can not scale beyond small deployments. At bigger companies, redundancies are in place which would mean simple alerts are not as important any more. How does one de-prioritize an alert where the rule is a 1 -> 1 relation to an alert. Alerts should only notify, if and only if some higher level measurement goes down. This higher level measurement would incorporate the idea of the primary and redundancy resources.
Again, typically, any alert would notify some people immediately. Due to the lack of heuristics/complex rules around alerts, it is a human driven incident management and response. Can we extend this rule -> alerting relation to incorporate complexity? Of course.
My peeve here is that if we do not want to be alerted on every alert then it is not necessarily an incident then. An incident, to me, is more involved and would contain notification/escalation policy. It can occur because something alerted a bunch of times. Incidents define their own aggregation level. It could have a whole history log such as escalations and remediations. One might say a root incident would absorb the children incidents in order to have one flat incident tracking everything.
In the end, monitor values, trigger some lower level structure (alerts?) and define incidents with notification policy and trigger using rules of lower level structures. Naming is hard... Rules, Alerts, Incidents...
Adding pub/sub on top of all of this, so other systems can use the data outside the scope of the architecture, is always a benefit. Examples are auditing, history and/or tracking SLAs.
Visualization requires a historical data source, so one can create panels of graphs to see the history of somewhat related values. As for third party open source software, Grafana does a good job generally.
Graph creation needs automation and can be automated if there is a standardization of tags and template panels that need values filled in. These template panels should be exportable to other areas. Good example is system metrics, which on linux servers are pretty much the same all throughout (cpu util %, cpu load, etc.). There isn't much of a need to have to create these sets of graphs five hundred times a day across different teams and companies if metric paths/tags are standardized.
The idea of dynamic graphs/dashboards where certain measurement names/units would automatically generate graphs pertaining to it. If I want overall temperature then I probably want to see a heatmap rather than a bunch of overlapping lines.
Some of these things I did and some I did not have the resources or level of support to do. In an ideal world, would have reach my goals of pushing things further in an historically underinvested area. This is something I'll likely do on my own time for my own projects.
Auditing measurement collection is something that needs investment for enterprises to make sure data collection is occurring. Relying on a host or etc. being up is not enough.