Using Uhoh

Bring it to Life

Uhoh in the Real World

Standing-up Fault and Performance Management using Uhoh Servers and Uhoh Clients very quickly delivers a comprehensive view of the health of your network. However, to take things forward to the next level, Uhoh needs to be hooked into other systems.

With a small amount of scripting, Uhoh can be used in conjunction with other platforms to deliver complete End-to-End Service Assurance solutions. A selection of sample Uhoh integration patterns are described below.

Operating Multiple Uhoh Domains

Utilising the Uhoh Client's capability for consuming log files unlocks the ability to chain several Uhoh systems together by configuring Uhoh Clients to consume the Log Streams of Uhoh Servers. This feature is useful in order to partition monitoring duties across several different domains within an organisation, bringing the output of the domains together to feed an overarching Manager of Managers Uhoh system. This pattern is illustrated in the diagram below:

Within this pattern:

Each domain uses a separate set of UDP ports for Uhoh Server to Uhoh Client communication.
The domains labelled Production and Testing manage all low-level monitoring duties for their respective areas. Fault and Performance Management Views are available for these two domains.
The domain labelled Operations Centre consumes specific notification alerts from the Production and Testing domains. This domain is configured to focus on higher-level alert types and metrics relevant for End-to-End Service Assurance.

Deploying separate Uhoh systems for different domains allows those domains to be empowered to manage and maintain monitoring coverage for their respective areas rather than relying on a central monitoring systems administration team - essential for effective DevOps delivery.

Advanced Service Quality Management

The Python script, log_consumer.py, supplied with Uhoh, has been designed simply to consume an Uhoh Server's Log Stream. But, this script can easily be modified to collect metrics from the Log Stream and:

Perform more advanced mathematical functions on metrics such as:
- Standard deviation.
- Percentile groupings.
Analyse metrics using external reference data for example:
- Writing-to and reading-from a BigData store.
- Implementation of machine learning algorithms for automatic baselining.
Calculate continuous service quality measures such as:
- By-service health indices derived from multiple key performance indicators.
Calculate periodic service level agreement measures such as:
- Weekly, monthly or quarterly reporting.
- Progress reporting indicating the likelihood of meeting or missing an SLA.
- Incorporating contractual exceptions such as planned maintenance events.
- The Perl script metric_aggregator.pl provides an example of a how metrics collected by Uhoh can be used to create a chart displaying daily average over a period spanning a number of days.
Discover and highlight more complex monitoring scenarios such as loss of resilience:
- Useful in highly dynamic environments such as Public Cloud.
- Where logic is required to ensure that the minimum running instances count is assured.
- Or where specific geographical redundancy policies need to be maintained.

The derived data can then, in turn, be consumed by an Uhoh Client in order to apply threshold alerting.

Implementing Closed-Loop Automation

By consuming the Uhoh Server's Log Stream, again using a modified version of log_consumer.py, Uhoh can be used to drive automated actions triggered by alerts - this is known as Closed-Loop Automation. Examples of such actions are:

Attaching additional storage if calculated storage consumption rates indicate that a limit may be close to being reached.
Re-starting a component if a certain sequence of faults is detected.
Pausing processing of queued jobs to reduce throughput for a component which has been detected as overloaded.

Effective automation generally requires an accurate model (graph) describing the topology of services. Using Uhoh Server Log Stream output together with a processing script to feed node relationship and health information into a graph database such as Neo4j is an effective method of addressing this problem for complex environments. But, Uhoh's own service modelling configuration capabilities will also suffice for most environments.

When running a fault-tolerant pair if Uhoh Servers, you will require only one of the pair to initiate close-loop automation activities. Therefore, the Log Stream handler script will need to suppress taking action if the Uhoh Server Log Stream it is consuming recently contained an FT_SECONDARY event. Care should also be taken to throttle closed-loop actions to prevent overloading of downstream platforms.

Implementing Semantic Monitoring

Semantic Monitoring is the periodic testing of sample flows or user-journeys to check on the health of a system. This sampling generally makes use of test data in such a way that the outcome of the checks will clearly indicate whether the system under test is operating correctly or not. An example of where semantic monitoring would be most useful would be for the testing of an API.

The Uhoh Client's alert_cmd directive can be used to invoke a shell script which contains appropriate curl commands to run semantic monitoring checks against web services end-points. The Uhoh Client can be configured to consume the output from the script - collecting performance metrics or triggering fault alerts as necessary.

Integrating with External Services

The power of scripting enhancements to an Uhoh system really comes into it's own when integrating Uhoh with external systems. Examples of such integrations are:

Using an Uhoh Server Log Stream consumer to periodically load metrics into Amazon Web Services CloudWatch. CloudWatch can then manage team notifications, threshold checks, data retention policies, dashboards and so on.
Using an Uhoh Server Log Stream consumer to feed a Slack Channel with fault alert notifications.
Using the alert_cmd Uhoh Client configuration directive to periodically run snmpget commands - the output of which are consumed by the Uhoh Client in order to collect performance metrics and fault alerts.

As with Closed-Loop Automation, integrations such as these need to be careful to avoid taking duplicate actions (use the FT_SECONDARY functionality) and make use of throttling to prevent overloading of downstream platforms.

Example Uhoh Server Log Stream handler scripts are provided with the Uhoh distribution:

Python script log_consumer.py simply reads the Uhoh Server Log Stream and writes to STDOUT.
Python script elastic_fantastic.py loads alert data from the Uhoh Server Log Stream into an ElasticSearch cluster.
Python script uhoh_slack.py loads alert data from the Uhoh Server Log Stream into a Slack channel.
Python script uhoh_teams.py loads alert data from the Uhoh Server Log Stream into a Microsoft Teams channel.
Perl script log_consumer.pl transfers the Uhoh Server Log Stream to STDOUT in the same way as log_consumer.py.

Instrumenting Your Applications

Uhoh has been primarily designed for collecting data through log files and the majority of Uhoh's functionality for alert and metric management is geared towards the use of log files. However, it is occasionally useful to be able to feed alerts (alarms or metrics) into Uhoh via other means. The following alternative methods of alert capture can be used as an alternative to log files:

Sending alerts to the Uhoh Client as UDP messages:
- Decouples the application being instrumented from Uhoh Client.
- Useful for environments where log files are difficult to manage (eg. Windows).
- Easy to do from a shell-script using the nc command.
- UDP messages need to be formatted to use the INJECT directive (see details in Set up a Client).
Sending alerts to the Uhoh Client as REST Web-Hooks:
- Could slow down an application if the Uhoh Client is slow to process a request. (NB: The Uhoh Client REST web server is single-threaded.)
- Easy to do from a shell-script using the curl command.
- See Set up a Client for further details.

Note that with UDP or Web-Hook integration, it is not possible to control ingest of alerts by date/time using the active directive or count or parse messages as with alert_count. It may therefore be necessary to use an Uhoh Client to read the Uhoh Server log file in order to implement advanced parsing of alerts. You can, however, use alert_multi with alerts delivered via UDP or Web-Hook.

Uhoh Client Configuration: Sample Recipes

Although Uhoh is designed to be extremely easy to configure, getting the Uhoh Client configuration just right takes a little practice. This section describes a number of scenarios and how Uhoh Client configuration can be used to address them.

Host Load Average: Collecting Metrics and Configuring Threshold Alerting

On Linux, use the uptime command to obtain host load average. The Uhoh Client is configured to run the uptime command periodically using alert_cmd. Finally, the alert_multi directive is used to raise a high-priority alert:

capture:           load averages: (\S+)

alert_cmd:         tags=METRIC_LOAD_#hostname# seconds=60 maximum=1 threshold_tags=CMD_UBE command=uptime

alert_multi:       tags=RED seconds=60 collect=CMD_UBE message=High load detected

The alert_cmd runs uptime every sixty seconds and writes the load average value collected to a metric called LOAD_#hostname# (where #hostname# is the name of the host running the Uhoh Client). A threshold of 1 is also set - if this is breached, an alert tagged CMD_UBE is raised. The alert_multi directive is then used to raise a RED alert (ie. an alert which appears in the Uhoh Fault Management View) if a CMD_UBE tagged alert is seen within the last sixty seconds.

Threshold for Minimum Traffic Levels at Different Times of the Day

Usage for applications varies over the course of the day. To set different minimum thresholds for 08:00 - 10:00, 10:00 - 18:00 and 18:00 - 22:00, based on logged transactions, use three separate alert_range directive instances, as follows:

file:          transaction.log

match:         Completed

active:        01234567;08:00-10:00

alert_range:   tags=RED seconds=60 minimum=10 message=Abnormally low traffic

file:          transaction.log

match:         Completed

active:        01234567;10:00-18:00

alert_range:   tags=RED seconds=60 minimum=50 message=Abnormally low traffic

file:          transaction.log

match:         Completed

active:        01234567;18:00-22:00

alert_range:   tags=RED seconds=60 minimum=20 message=Abnormally low traffic

Variable Alert Based on a Value in a Log File

If a process is logging a value to a log file, and we need to use Uhoh to generate different alerts based on this value, we can use alert_average multiple times:

file:          transaction.log

capture:       Level: (\d+)

alert_average: tags=AMBER seconds=60 maximum=10 message=Warning: Level is high

file:          transaction.log

capture:       Level: (\d+)

alert_average: tags=RED seconds=60 maximum=20 message=Critical: Level is too high

An amber warning alert will be raised for the lower threshold breach of "Level" and a red critical alert for the upper threshold breach.

Polling an SNMP Source

Uhoh is primarily focused on making use of log file analysis to determine the health of your system. However, using alert_cmd it is quite simple to set up SNMP Polling to collect performance metrics. For example:

capture:    HOST-RESOURCES-MIB::hrSystemProcesses.0 = Gauge32: (\d+)

alert_cmd:  tags=GREEN,METRIC_PROCS_#hostname# seconds=60 command=snmpget -v 2c -c public localhost HOST-RESOURCES-MIB::hrSystemProcesses.0

The Uhoh Client configuration snippet above runs the snmpget command every sixty seconds and extracts a specific metric from the result, using the capture configuration directive.

This example runs the snmpget command against the local host that the Uhoh Client is running on. However, for SNMP polling against elements or hosts which are unable to run the Uhoh Client, a similar configuration could be used, but running on another Uhoh Client - for example a Client which polls many SNMP sources. For this latter scenario, tags would be needed to distinguish between metrics collected from different sources.

Extracting Multiple Items of Information From Running a Command

When extracting multiple items of information from a log file, for example:

Extracting a value.
Counting exceptions.
Extracting a specific string.

... you would use a separate alert_all, alert_count, alert_average etc. directive for each item. However, when using alert_cmd, you don't want to be running the command multiple times - once for each extract. Therefore, an alternative approach is required.

A technique available to solve this problem is to make use of the Uhoh Server's Log Stream (server.log) file as follows:

Firstly, use an alert_cmd to run the command, matching all lines which need to be returned, matching any line containing the string Result:

match:           Result:

alert_cmd:       tags=COMMAND_OUTPUT seconds=60 command=run_all_checks.sh

This will write all lines matched to the Log Stream server.log file, tagged as COMMAND_OUTPUT.

Then, configure separate log captures to read and parse the Log Stream as required (these configuration items need to be set for an Uhoh Client which is running on the same host as the Uhoh Server):

file:            server.log

match:           COMMAND_OUTPUT.+Result: Process count = (\d+)

alert_average:   tags=METRIC_PROCS seconds=60

file:            server.log

match:           COMMAND_OUTPUT.+Result: Exception:

alert_count:     tags=METRIC_EXCEPTIONS seconds=60

file:            server.log

match:           COMMAND_OUTPUT.+Result: Exception: Null

alert_all:       tags=RED message=Null value detected

Capturing a Logged Metric and Applying a Threshold Check to Drive the Service Map View

Let's say you have a metric which is periodically written to a log file. Using Uhoh, you may often want to:

Capture the metric for viewing on an Uhoh Performance Management View chart.
If the metric exceeds a specific value:
- Trigger an alert to be displayed in the Uhoh Fault Management View.
- Display a Service Impact Analysis on an Uhoh Service Map View.

Here's how to do it.

The log file could look something like this:

KPI: ORDER_VOLUME: 48

We would use alert_average to capture the metric (averaged over a specific time-period) and apply the threshold check - producing an amber alert in the Uhoh Fault Management View. The SVC_ORDER_VOL tag is used to drive the Service Impact Analysis display:

file:           logfile.log

capture:        KPI: ORDER_VOLUME: (\d+)

alert_average:  tags=AMBER,SVC_ORDER_VOL seconds=60 maximum=30 message=Order volume is abnormally high

We then use alert_all to write the metric to a Performance Management chart:

file:           logfile.log

capture:        KPI: ORDER_VOLUME: (\d+)

alert_all:      tags=METRIC_ORDER_VOL

And finally, we would use alert_multi to display the service impact in a Service View Map as well as an additional red alert in the Fault Management View. The SVC_ORDER_VO tag created earlier is used to trigger this alert:

alert_multi:    tags=RED,MAP_ORDERS,MAP_SIGN_UP collect=SVC_ORDER_VOL seconds=120 message=Orders >> Sign-Up

(An Uhoh Service Map View will have been previously created containing items tagged as MAP_ORDERS and MAP_SIGN_UP.)

Google Sites

Report abuse