The Client

How to set up an Uhoh client

Starting an Uhoh Client

To run an Uhoh Client on a host, only the uhoh.jar file from the Uhoh distribution is required to be present on that host. Copy this file to a folder on the host to be monitored and run the following command to start the Uhoh Client:

java -cp uhoh.jar com.uhoh.Client 8888

Port 8888 is the default Uhoh Server to Uhoh Client UDP broadcast port. If you have configured your Uhoh Server to use an alternative port, specify that port instead.

The Uhoh Client will listen for UDP broadcast advert messages from the Uhoh Server and on receiving such a message will request a configuration. Uhoh Clients, by default, request a configuration with the name of the host that the Uhoh Client is running on.

If you are running Uhoh in an environment that doesn't support UDP broadcast (for example Amazon Web Services), then you will need to specify the IP address or addresses of Uhoh Servers when starting an Uhoh Client using the servers parameter:

java -cp uhoh.jar com.uhoh.Client 8888 servers=10.1.1.1,10.1.1.2

The Uhoh Servers are specified as a comma-separated list of IP addresses or DNS names. The Uhoh Client will poll each Uhoh Server in the list until one of them responds with a configuration file.

It is also possible for the Uhoh Client to specify exactly which configuration file it should use from the Uhoh Server's 'clientconfigs' library rather than requesting a configuration file with the same name as the host that the Uhoh Client is running on. To do this, use the optional type parameter when starting the Uhoh Client:

java -cp uhoh.jar com.uhoh.Client 8888 type=webserver

The 'type' parameter would be most commonly used in highly dynamic environments where virtual hosts are started frequently and the Uhoh Clients running on these hosts can inform the Uhoh Server of the type of host that has been started, rather than relying on hostname (which may be dynamically assigned).

The Uhoh Client Configuration File in Detail

Each host to be monitored by Uhoh requires an Uhoh Client Configuration File. This file is placed within the clientconfigs folder on the Uhoh Server host. The file's name is the name of the host to be monitored. When an Uhoh Client is started on a host, the Uhoh Client will request a monitoring configuration file named with that host's name from the Uhoh Server.

The Uhoh Client Configuration File for each host contains sets of directives. These directives specify the types of monitoring operations that the Uhoh Client should perform. All directive types are described below.

Alerts sent from Uhoh Clients to Uhoh Servers are labelled with a comma-separated list of tags. These tags are to be used by programs which consume the server's Log Stream and are used to differentiate between alert types. There are also three special tags - RED, AMBER and GREEN. Alerts tagged with these values will appear in the Fault Management View provided by the Uhoh Server's built-in web server as well as the Log Stream.

If the alert message or alert tags contain the string #hostname#, then this string will be replaced by the name of the host that the Uhoh Client is running on. This feature is useful for tagging alerts as originating from a specific host.

Extracting Information From Log Files

The following directives are used to initiate collection of textual information from log files:

alert_all - alert if a line in a log file matches a regular expression.
alert_count - count the number of times a line matching a regular expression appears in a log file within a specific time period.
alert_range - alert if the number of regular expression matches in a log file within a specific time period exceeds or falls below set limits.
alert_total - pick numerical values from a log file and add them together over a specific period of time.
alert_minimum - pick numerical values from a log file and output the minimum value over a specific period of time.
alert_maximum- pick numerical values from a log file and output the maximum value over a specific period of time.
alert_average- pick numerical values from a log file and output the average of all values over a specific period of time.

The alert_all directive configures the Uhoh Client to consume a log file and triggers an alert each time a line containing a particular string match is found. This directive is used as follows:

file: <log_file_name>

match: <regular_expression>

alert_all: tags=<tags>

You can also add an optional active directive which describes when this check will be active. The parameter supplied to the active directive is a comma-separated list of <days>;<hour/min-hour/min> items.

<days> are days of the week - numbered 1-7 where 1 = Sunday, 2 = Monday etc.
<hour/min-hour/min> is written as start hour/min - end hour/min in the format HH:MM-HH:MM and defines a time range for when the directive will be active.

The active directive can be used with any alert directive type.

Below is an example of alert_all where lines containing "Type: <number>" are found. The consumer is only active from 09:00 - 17:30 Monday - Friday and 10:00 - 16:00 Saturday - Sunday.

file: logfile.log

active: 23456;09:00-17:30,17;10:00-16:00

match: Type: \d+

alert_all: tags=RED

The alert_all line can also trigger a customised message, as follows:

alert_all: tags=RED message=Type identified

In this case, the alert recorded (and sent to the Fault Management View if a RED, AMBER or GREEN tag is included in the tag list) will contain the customised message rather than the line in the log file which was captured. Note that "message=" has to be the final attribute on the "alert_all:" line.

We may also want to extract a particular item from a log file string match - for example a numerical value. This can be done using the capture directive instead of match and indicates which part of the log file line needs to be retained. The part of the line to be retained is specified within parentheses.

The following example shows a numerical value being extracted from an incoming log file line:

file: logfile.log

capture: Type: (\d+)

alert_all: tags=THE_VALUE

... which would output "109" if logfile.log contained the line "Type: 109".

This feature can also be used, for example, to remove unwanted parts of log file lines - eg. time-stamps. The line:

Sun Jan 03 20:53:54 GMT 2016 [Thread-2]: Warning: Restart in progress

... could be captured as simply "Warning: Restart in progress" by using:

capture: Thread.\d+.: (.+)

This feature is also useful for de-duplicating multiple alerts containing different time-stamps for when they are tagged with RED, AMBER or GREEN to appear in the Fault Management View.

Where an alert is of the form:

<string>: <numeric value>

... and one of the tags for the Schnauzer starts with "METRIC_", the Server will write the numeric value captured to a file in the Server's root folder called "metrics/YYYY-MM-DD/<metric_name>" - where:

<metric_name> is the part of the tag following "METRIC_".

For example:

file: logfile.log

capture: Type: (\d+)

alert_all: tags=METRIC_A_VALUE

... will append a line to "metrics/YYYY-MM-DD/A_VALUE" each time "Type: <value>" is encountered in logfile.log. A graph of the metric's values can then be viewed using the Performance Management View.

Here's the same example again, but this time the #hostname# variable is used to ensure that the metric is written to a file unique to a particular host:

file: logfile.log

capture: Type: (\d+)

alert_all: tags=METRIC_A_#hostname#

Use of #hostname# is essential where multiple hosts share the same Uhoh Client configuration, but we require unique sets of metrics to be collected from each host.

Counting Messages in Log Files

The alert_count directive triggers a periodic alert which indicates the number of times a particular regular expression has been matched in a log file over a set period of time.

For example, the following:

file: logfile.log

match: Type: \d+

alert_count: tags=RED seconds=60

... triggers an alert every minute which contains the number of times the string "Type: <number>" appears in logfile.log. The "RED" tag means that this alert will be displayed as a red (highest priority) alarm in the Fault Management View.

If an alert_count tag starts with METRIC_, then the Server will log the value contained in the alert to "metrics/YYYY-MM-DD/<metric_name>".

It is also possible to perform basic mathematical operations on values captured from log files using the alert_total, alert_minimum, alert_maximum and alert_average directives. This feature is useful for charting, for example, periodic minimum, maximum and average handling times for a web service captured from an Apache HTTPD access log file. These all require a "capture" directive to be used and will output a value after the period of time given by the seconds=<n> directive.

Here are some examples:

file: logfile.log

capture: Type: (\d+)

alert_total: tags=METRIC_TOTAL seconds=60

file: logfile.log

capture: Type: (\d+)

alert_minimum: tags=METRIC_MINIMUM seconds=60

file: logfile.log

capture: Type: (\d+)

alert_maximum: tags=METRIC_MAXIMUM seconds=60

file: logfile.log

capture: Type: (\d+)

alert_average: tags=METRIC_AVERAGE seconds=60

Note that no alert is triggered for average, minimum or maximum if nothing has been captured from the log file during the specified interval.

The alert_total and alert_average directives can also be used for threshold alerting by specifying a minimum or maximum value plus alert message, for example:

file: logfile.log

capture: Type: (\d+)

alert_total: tags=ALERT_TOTAL seconds=60 minimum=10 message=Total is below ten

file: logfile.log

capture: Type: (\d+)

alert_total: tags=ALERT_TOTAL seconds=60 maximum=10 message=Total is above ten

file: logfile.log

capture: Type: (\d+)

alert_total: tags=ALERT_TOTAL seconds=60 minimum=5 minimum=10 message=Total is outside of range 5-10

Or:

file: logfile.log

capture: Type: (\d+)

alert_average: tags=ALERT_AVERAGE seconds=60 minimum=10 message=Average is below ten

file: logfile.log

capture: Type: (\d+)

alert_average: tags=ALERT_AVERAGE seconds=60 maximum=10 message=Average is above ten

file: logfile.log

capture: Type: (\d+)

alert_average: tags=ALERT_AVERAGE seconds=60 minimum=5 minimum=10 message=Average is outside of range 5-10

The above are extremely useful for raising alerts based on component throughput and average handling times exceeding design thresholds.

The alert_range directive is used to trigger a periodic alert if the number of matches in a log file falls outside a set range during that time period. For example, the following raises an alert if there are less than 3 or more than 6 instances of the string "Type X" appearing in logfile.log within a minute:

file: logfile.log

match: Type X

alert_range: tags=AMBER,X_RANGE seconds=60 minimum=3 maximum=6

Directives minimum= and maximum= can be specified on their own, for example, the following could be used to raise an alert if the file logfile.log hasn't been updated at all over the last minute (useful for detecting if a program has entered a "hung" state) and we require this check to only be active between 07:00 and 22:00 on any day of the week:

file: logfile.log

match: .+

active: 1234567;07:00-22:00

alert_range: tags=INACTIVE seconds=60 minimum=1

(Note that .+ matches one or more of any character - ie. any line in the log file.)

The following example triggers an alert if more than 20 lines containing "Exception" appear in logfile.log within a five minute period:

file: logfile.log

match: Exception

alert_range: tags=RED,EXCEPTION seconds=300 maximum=20

Directive alert_range, just like alert_all can be provided with a customised message to include in the alert instead of the match count. For example:

alert_range: tags=RED,EXCEPTION seconds=300 maximum=20 message=Out of bounds

Basic Host Infrastructure Monitoring

The next set of configuration directives are used to examine disk usage and running processes:

alert_disk - monitors disk free space.
alert_process - monitors running processes.

Directive alert_disk is used to check available space on a disk.

The following raises an alert if utilised space on the root filesystem exceeds 80%:

file: /

alert_disk: tags=ROOT_FS,RED maximum=80

Multiple alert_disk directives can be used together, so if the following is combined with the above:

file: /

alert_disk: tags=ROOT_FS,AMBER maximum=70

... then an Amber alert (in the Fault Management View) will appear if root filesystem space exceeds 70%, and then a Red alert will appear if space exceeds 80%.

To monitor for the presence (or non-presence) of processes, use the alert_process directive.

Firstly, we need to set the command used to fetch the process table. This is done using the ps_command directive. For example, on Linux we could set this as follows:

ps_command: ps -fe

On Windows, we would use:

ps_command: tasklist.exe

Then, we define pattern matches and alert_process statements. For example, to raise an alert if no Apache httpd process are running or if more than 20 such processes are running:

match: httpd

alert_process: tags=AMBER,APACHE minimum=1 maximum=20

Note that we have to specify both minimum and maximum unlike with alert_count. If we would like an alert to be triggered if anything other than a set number of processes are running, then we specify exactly instead of minimum and maximum.

For example, if only one instance of "chart_load" should be running at any time:

match: chart_load

alert_process tags=CHARTING exactly=1

Running Commands on a Host

The Uhoh Client can also run periodic commands and raise alerts if any lines in the command output match a regular expression. This is generally used for custom monitoring by running a script and is achieved using the alert_cmd directive.

The following runs "df -i ." every 150 seconds and captures any lines which contain a number, followed by one or more spaces, then another number, then another one or more spaces. The active directive means that the command will only be run from 17:00 - 19:00 on Saturday and Sunday.

match: \d+\s+\d+\s+

active: 17;17:00-19:00

alert_cmd: tags=CMD,GREEN seconds=150 command=df -i .

If an alert_cmd tag starts with METRIC_, and the command output is of the form:

<string>: <value>

... then the Uhoh Server will log the value contained in the alert to the file <date>/<metric_name> within the metrics folder.

<date> has the format: YYYY-MM-DD.

The alert_cmd directive can also be used with capture rather than match, for example:

capture: DISK2\s+\d+\s+(\d+)

alert_cmd: tags=CMD,GREEN seconds=150 command=df -i .

The above will find lines that are of the form:

DISK2 <number1> <number2>

... and return the value of <number2> only.

The alert_cmd directive can also be provided with threshold values which apply when used in conjunction with the capture directive. For example:

capture: DISK2\s+\d+\s+(\d+)

alert_cmd: tags=CMD,GREEN seconds=150 maximum=10 command=df -i .

... will raise a second alert (in addition to the alert containing the actual value of the metric), containing the text Exceeded Upper Bound if the value captured exceeds 10. Both the metric and threshold alerts will be tagged with the same tags - CMD,GREEN in this case. Likewise:

capture: DISK2\s+\d+\s+(\d+)

alert_cmd: tags=CMD,GREEN seconds=150 minimum=5 command=df -i .

... will raise a second alert, containing the text Below Lower Bound if the value captured is below 5.

Both the maximum and minimum parameters can be used in the same alert_cmd declaration:

capture: DISK2\s+\d+\s+(\d+)

alert_cmd: tags=CMD,GREEN seconds=150 minimum=5 maximum=10 command=df -i .

So that a different set of alert tags can be used for the metric logging and metric threshold events, alternative tags can be specified using the threshold_tags parameter to alert_cmd, as follows:

capture: DISK2\s+\d+\s+(\d+)

alert_cmd: tags=CMD,METRIC_LEVEL_#hostname# seconds=150 minimum=5 threshold_tags=RED command=df -i .

The above configuration will result in:

The metric taken from the output of the command being logged using tags CMD,METRIC_LEVEL_#hostname# - where #hostname# is replaced by the name of the host that the Uhoh Client is running on.
The threshold breach alert being logged using the tag RED.

If threshold_tags isn't specified, then both alerts use the tags specified by the tags parameter.

Network Monitoring and Alert Correlation

The final set of directives available for use with an Uhoh Client are:

alert_tcp - used to check whether a TCP/IP socket server (for example a web server) is accepting connections.
alert_multi - correlates a set of alerts over a specified period of time to generate further alerts.
alert_rest - allows incoming web-hooks to raise alerts.

In order to check whether a TCP server is ready to accept incoming connections (for example, a web server), use the alert_tcp directive.

The following Uhoh Client configuration line runs a check every 60 seconds to determine whether port 80 at 127.0.0.1 is accepting connections:

alert_tcp: tags=WEB_SRV,RED seconds=60 ip=127.0.0.1 port=80 timeout=2 message=Apache HTTPD is not running

The timeout= parameter specifies that the Uhoh Client will only wait for two seconds before giving up. The message= parameter contains the message to log if the connection test fails and must be the last parameter specified on the alert_tcp line.

Correlation of multiple alerts to generated further alerts is achieved using the alert_multi directive. In this way, an alert will triggered if an Uhoh Client detects a series of alerts containing specific tags in a watchlist within a set period of time. Here's an example:

file: test.log

match: type_1

alert_all: tags=TAG1,INFO

file: test.log

match: type_2

alert_all: tags=TAG2,INFO

alert_multi: tags=RED,DERIVED seconds=60 collect=TAG1,TAG2 message=Both types found

Here we've set up two file consumers (alert_all) which detect log lines containing "type_1" and "type_2". Each consumer triggers an alert containing either TAG1 or TAG2. The alert_multi line uses it's collect attribute to specify that if other alerts containing the tags TAG1 and TAG2 have been raised in the last 60 seconds, then an alert tagged with RED,DERIVED and the message text "Both types found" will be raised.

As the correlation using alert_multi takes place within a single Uhoh Client instance, a slightly modified configuration pattern is required if correlation of alerts needs to be performed across multiple Uhoh Clients. The pattern runs an Uhoh Client on the same host as the Uhoh Server and this Client is used to correlate the alerts using the Uhoh Server's Log Stream as a source. Here's how this would be achieved:

The Uhoh Client configuration file for Host_A would specify:

file: test.log

match: type_1

alert_all: tags=MULTITAG1

The Uhoh Client configuration file for Host_B would specify:

file: test.log

match: type_1

alert_all: tags=MULTITAG2

The Uhoh Client configuration file for the host running the Uhoh Server would contain the following in order to capture MULTITAG1 and MULTITAG2 from the Log Stream, re-tag them as MCRR1 and MCRR2 and then correlate using tags MCRR1 and MCRR2:

file: server.log

match: MULTITAG1

alert_all: tags=MCRR1

file: server.log

match: MULTITAG2

alert_all: tags=MCRR2

alert_multi: tags=RED,DERIVED seconds=60 collect=MCRR1,MCRR2 message=Both types found

Choose the tags used for correlation carefully so as to avoid "loops" where a derived alert then re-triggers the same correlation.

It's also possible to configure a Client to receive alert notifications as incoming REST web hooks. To set this up, use the alert_rest directive:

alert_rest: port=<tcp_port_number>

For example:

alert_rest: port=5656

REST requests use a URL formed as follows:

/alert/<TAGS>/<MESSAGE>

For example, using the CURL command:

curl 'http://192.168.1.20:5656/alert/RED,MSGQ/Inbound_queue_server:%020Queue%20is%20full'

Note that '%20' is used to represent spaces in the message text and that the alert hostname will be the name of the host on which the recipent Uhoh Client is running.

Instrumenting Your Applications

You may prefer to incorporate instrumentation into an application in order to feed alerts and metrics into Uhoh rather than have Uhoh read an application's log file. The simplest way to do this is for the application to send a UDP message to the Uhoh Client containing a string in the following format:

INJECT%%<TAGS>%%<MESSAGE>

For example, a shell-script could send an alert to Uhoh using the netcat (nc) command-line utility:

echo "INJECT%%RED,TEST_QUEUE%%This is a test alert" | nc -4u -w0 127.0.0.1 8888

Using UDP means that the application instrumentation isn't tightly-coupled to the Uhoh Client which could affect performance of the application.

Prompting an Uhoh Client to Reload its Configuration

After an Uhoh Client configuration file has been edited on the Uhoh Server, the Uhoh Client which uses this configuration file will need to re-load the configuration in order to pick up the changes. There are two ways that this an be done:

Re-start the Uhoh Client in question.
Prompt the Uhoh Server to inform the Uhoh Client that it's configuration has changed.

For the second option, above, create a file in the Uhoh Server's installation folder called .reset which contains the host name for the Uhoh Client which needs re-configuring. The Uhoh Server, on finding the .reset file, will send a reset message to any Uhoh Client host listed in the file. (The .reset file can contain multiple host names.) The Uhoh Client will then reset itself to remove all monitoring configuration and return itself to it's start-up state. It will then request configuration from the Uhoh Server and start monitoring as normal. Note that the Uhoh Client doesn't actually re-start itself during this procedure.

Removing an Uhoh Client from the Uhoh Server's Watchlist

If an Uhoh Client is closed down, for example if the host it is running on is to be decommissioned, then the Uhoh Server will generate alerts indicating that the Uhoh Client in question is no longer running until the time specified via the client_remove_time Uhoh Server configuration parameter. However, if you need to remove the hostname of an Uhoh Client from an Uhoh Server's watchlist before client_remove_time occurs, create a file called .forget in the Uhoh Server's installation folder containing the hostname of the Uhoh Client to be removed from the watchlist.

Note that this action will need to be performed on all Uhoh Servers within a resilient set.

Google Sites

Report abuse