Using Uhoh
Bring it to Life
Bring it to Life
Standing-up Fault and Performance Management using Uhoh Servers and Uhoh Clients very quickly delivers a comprehensive view of the health of your network. However, to take things forward to the next level, Uhoh needs to be hooked into other systems.
With a small amount of scripting, Uhoh can be used in conjunction with other platforms to deliver complete End-to-End Service Assurance solutions. A selection of sample Uhoh integration patterns are described below.
Utilising the Uhoh Client's capability for consuming log files unlocks the ability to chain several Uhoh systems together by configuring Uhoh Clients to consume the Log Streams of Uhoh Servers. This feature is useful in order to partition monitoring duties across several different domains within an organisation, bringing the output of the domains together to feed an overarching Manager of Managers Uhoh system. This pattern is illustrated in the diagram below:
Within this pattern:
Deploying separate Uhoh systems for different domains allows those domains to be empowered to manage and maintain monitoring coverage for their respective areas rather than relying on a central monitoring systems administration team - essential for effective DevOps delivery.
The Python script, log_consumer.py, supplied with Uhoh, has been designed simply to consume an Uhoh Server's Log Stream. But, this script can easily be modified to collect metrics from the Log Stream and:
The derived data can then, in turn, be consumed by an Uhoh Client in order to apply threshold alerting.
By consuming the Uhoh Server's Log Stream, again using a modified version of log_consumer.py, Uhoh can be used to drive automated actions triggered by alerts - this is known as Closed-Loop Automation. Examples of such actions are:
Effective automation generally requires an accurate model (graph) describing the topology of services. Using Uhoh Server Log Stream output together with a processing script to feed node relationship and health information into a graph database such as Neo4j is an effective method of addressing this problem for complex environments. But, Uhoh's own service modelling configuration capabilities will also suffice for most environments.
When running a fault-tolerant pair if Uhoh Servers, you will require only one of the pair to initiate close-loop automation activities. Therefore, the Log Stream handler script will need to suppress taking action if the Uhoh Server Log Stream it is consuming recently contained an FT_SECONDARY event. Care should also be taken to throttle closed-loop actions to prevent overloading of downstream platforms.
Semantic Monitoring is the periodic testing of sample flows or user-journeys to check on the health of a system. This sampling generally makes use of test data in such a way that the outcome of the checks will clearly indicate whether the system under test is operating correctly or not. An example of where semantic monitoring would be most useful would be for the testing of an API.
The Uhoh Client's alert_cmd directive can be used to invoke a shell script which contains appropriate curl commands to run semantic monitoring checks against web services end-points. The Uhoh Client can be configured to consume the output from the script - collecting performance metrics or triggering fault alerts as necessary.
The power of scripting enhancements to an Uhoh system really comes into it's own when integrating Uhoh with external systems. Examples of such integrations are:
As with Closed-Loop Automation, integrations such as these need to be careful to avoid taking duplicate actions (use the FT_SECONDARY functionality) and make use of throttling to prevent overloading of downstream platforms.
Example Uhoh Server Log Stream handler scripts are provided with the Uhoh distribution:
Uhoh has been primarily designed for collecting data through log files and the majority of Uhoh's functionality for alert and metric management is geared towards the use of log files. However, it is occasionally useful to be able to feed alerts (alarms or metrics) into Uhoh via other means. The following alternative methods of alert capture can be used as an alternative to log files:
Note that with UDP or Web-Hook integration, it is not possible to control ingest of alerts by date/time using the active directive or count or parse messages as with alert_count. It may therefore be necessary to use an Uhoh Client to read the Uhoh Server log file in order to implement advanced parsing of alerts. You can, however, use alert_multi with alerts delivered via UDP or Web-Hook.
Although Uhoh is designed to be extremely easy to configure, getting the Uhoh Client configuration just right takes a little practice. This section describes a number of scenarios and how Uhoh Client configuration can be used to address them.
On Linux, use the uptime command to obtain host load average. The Uhoh Client is configured to run the uptime command periodically using alert_cmd. Finally, the alert_multi directive is used to raise a high-priority alert:
capture: load averages: (\S+)
alert_cmd: tags=METRIC_LOAD_#hostname# seconds=60 maximum=1 threshold_tags=CMD_UBE command=uptime
alert_multi: tags=RED seconds=60 collect=CMD_UBE message=High load detected
The alert_cmd runs uptime every sixty seconds and writes the load average value collected to a metric called LOAD_#hostname# (where #hostname# is the name of the host running the Uhoh Client). A threshold of 1 is also set - if this is breached, an alert tagged CMD_UBE is raised. The alert_multi directive is then used to raise a RED alert (ie. an alert which appears in the Uhoh Fault Management View) if a CMD_UBE tagged alert is seen within the last sixty seconds.
Usage for applications varies over the course of the day. To set different minimum thresholds for 08:00 - 10:00, 10:00 - 18:00 and 18:00 - 22:00, based on logged transactions, use three separate alert_range directive instances, as follows:
file: transaction.log
match: Completed
active: 01234567;08:00-10:00
alert_range: tags=RED seconds=60 minimum=10 message=Abnormally low traffic
file: transaction.log
match: Completed
active: 01234567;10:00-18:00
alert_range: tags=RED seconds=60 minimum=50 message=Abnormally low traffic
file: transaction.log
match: Completed
active: 01234567;18:00-22:00
alert_range: tags=RED seconds=60 minimum=20 message=Abnormally low traffic
If a process is logging a value to a log file, and we need to use Uhoh to generate different alerts based on this value, we can use alert_average multiple times:
file: transaction.log
capture: Level: (\d+)
alert_average: tags=AMBER seconds=60 maximum=10 message=Warning: Level is high
file: transaction.log
capture: Level: (\d+)
alert_average: tags=RED seconds=60 maximum=20 message=Critical: Level is too high
An amber warning alert will be raised for the lower threshold breach of "Level" and a red critical alert for the upper threshold breach.
Uhoh is primarily focused on making use of log file analysis to determine the health of your system. However, using alert_cmd it is quite simple to set up SNMP Polling to collect performance metrics. For example:
capture: HOST-RESOURCES-MIB::hrSystemProcesses.0 = Gauge32: (\d+)
alert_cmd: tags=GREEN,METRIC_PROCS_#hostname# seconds=60 command=snmpget -v 2c -c public localhost HOST-RESOURCES-MIB::hrSystemProcesses.0
The Uhoh Client configuration snippet above runs the snmpget command every sixty seconds and extracts a specific metric from the result, using the capture configuration directive.
This example runs the snmpget command against the local host that the Uhoh Client is running on. However, for SNMP polling against elements or hosts which are unable to run the Uhoh Client, a similar configuration could be used, but running on another Uhoh Client - for example a Client which polls many SNMP sources. For this latter scenario, tags would be needed to distinguish between metrics collected from different sources.
When extracting multiple items of information from a log file, for example:
... you would use a separate alert_all, alert_count, alert_average etc. directive for each item. However, when using alert_cmd, you don't want to be running the command multiple times - once for each extract. Therefore, an alternative approach is required.
A technique available to solve this problem is to make use of the Uhoh Server's Log Stream (server.log) file as follows:
Firstly, use an alert_cmd to run the command, matching all lines which need to be returned, matching any line containing the string Result:
match: Result:
alert_cmd: tags=COMMAND_OUTPUT seconds=60 command=run_all_checks.sh
This will write all lines matched to the Log Stream server.log file, tagged as COMMAND_OUTPUT.
Then, configure separate log captures to read and parse the Log Stream as required (these configuration items need to be set for an Uhoh Client which is running on the same host as the Uhoh Server):
file: server.log
match: COMMAND_OUTPUT.+Result: Process count = (\d+)
alert_average: tags=METRIC_PROCS seconds=60
file: server.log
match: COMMAND_OUTPUT.+Result: Exception:
alert_count: tags=METRIC_EXCEPTIONS seconds=60
file: server.log
match: COMMAND_OUTPUT.+Result: Exception: Null
alert_all: tags=RED message=Null value detected
Let's say you have a metric which is periodically written to a log file. Using Uhoh, you may often want to:
Here's how to do it.
The log file could look something like this:
KPI: ORDER_VOLUME: 48
We would use alert_average to capture the metric (averaged over a specific time-period) and apply the threshold check - producing an amber alert in the Uhoh Fault Management View. The SVC_ORDER_VOL tag is used to drive the Service Impact Analysis display:
file: logfile.log
capture: KPI: ORDER_VOLUME: (\d+)
alert_average: tags=AMBER,SVC_ORDER_VOL seconds=60 maximum=30 message=Order volume is abnormally high
We then use alert_all to write the metric to a Performance Management chart:
file: logfile.log
capture: KPI: ORDER_VOLUME: (\d+)
alert_all: tags=METRIC_ORDER_VOL
And finally, we would use alert_multi to display the service impact in a Service View Map as well as an additional red alert in the Fault Management View. The SVC_ORDER_VO tag created earlier is used to trigger this alert:
alert_multi: tags=RED,MAP_ORDERS,MAP_SIGN_UP collect=SVC_ORDER_VOL seconds=120 message=Orders >> Sign-Up
(An Uhoh Service Map View will have been previously created containing items tagged as MAP_ORDERS and MAP_SIGN_UP.)