To run an Uhoh Client on a host, only the uhoh.jar file from the Uhoh distribution is required to be present on that host. Copy this file to a folder on the host to be monitored and run the following command to start the Uhoh Client:
java -cp uhoh.jar com.uhoh.Client 8888
Port 8888 is the default Uhoh Server to Uhoh Client UDP broadcast port. If you have configured your Uhoh Server to use an alternative port, specify that port instead.
The Uhoh Client will listen for UDP broadcast advert messages from the Uhoh Server and on receiving such a message will request a configuration. Uhoh Clients, by default, request a configuration with the name of the host that the Uhoh Client is running on.
If you are running Uhoh in an environment that doesn't support UDP broadcast (for example Amazon Web Services), then you will need to specify the IP address or addresses of Uhoh Servers when starting an Uhoh Client using the servers parameter:
java -cp uhoh.jar com.uhoh.Client 8888 servers=10.1.1.1,10.1.1.2
The Uhoh Servers are specified as a comma-separated list of IP addresses or DNS names. The Uhoh Client will poll each Uhoh Server in the list until one of them responds with a configuration file.
It is also possible for the Uhoh Client to specify exactly which configuration file it should use from the Uhoh Server's 'clientconfigs' library rather than requesting a configuration file with the same name as the host that the Uhoh Client is running on. To do this, use the optional type parameter when starting the Uhoh Client:
java -cp uhoh.jar com.uhoh.Client 8888 type=webserver
The 'type' parameter would be most commonly used in highly dynamic environments where virtual hosts are started frequently and the Uhoh Clients running on these hosts can inform the Uhoh Server of the type of host that has been started, rather than relying on hostname (which may be dynamically assigned).
Each host to be monitored by Uhoh requires an Uhoh Client Configuration File. This file is placed within the clientconfigs folder on the Uhoh Server host. The file's name is the name of the host to be monitored. When an Uhoh Client is started on a host, the Uhoh Client will request a monitoring configuration file named with that host's name from the Uhoh Server.
The Uhoh Client Configuration File for each host contains sets of directives. These directives specify the types of monitoring operations that the Uhoh Client should perform. All directive types are described below.
Alerts sent from Uhoh Clients to Uhoh Servers are labelled with a comma-separated list of tags. These tags are to be used by programs which consume the server's Log Stream and are used to differentiate between alert types. There are also three special tags - RED, AMBER and GREEN. Alerts tagged with these values will appear in the Fault Management View provided by the Uhoh Server's built-in web server as well as the Log Stream.
If the alert message or alert tags contain the string #hostname#, then this string will be replaced by the name of the host that the Uhoh Client is running on. This feature is useful for tagging alerts as originating from a specific host.
The following directives are used to initiate collection of textual information from log files:
The alert_all directive configures the Uhoh Client to consume a log file and triggers an alert each time a line containing a particular string match is found. This directive is used as follows:
file: <log_file_name>
match: <regular_expression>
alert_all: tags=<tags>
You can also add an optional active directive which describes when this check will be active. The parameter supplied to the active directive is a comma-separated list of <days>;<hour/min-hour/min> items.
The active directive can be used with any alert directive type.
Below is an example of alert_all where lines containing "Type: <number>" are found. The consumer is only active from 09:00 - 17:30 Monday - Friday and 10:00 - 16:00 Saturday - Sunday.
file: logfile.log
active: 23456;09:00-17:30,17;10:00-16:00
match: Type: \d+
alert_all: tags=RED
The alert_all line can also trigger a customised message, as follows:
alert_all: tags=RED message=Type identified
In this case, the alert recorded (and sent to the Fault Management View if a RED, AMBER or GREEN tag is included in the tag list) will contain the customised message rather than the line in the log file which was captured. Note that "message=" has to be the final attribute on the "alert_all:" line.
We may also want to extract a particular item from a log file string match - for example a numerical value. This can be done using the capture directive instead of match and indicates which part of the log file line needs to be retained. The part of the line to be retained is specified within parentheses.
The following example shows a numerical value being extracted from an incoming log file line:
file: logfile.log
capture: Type: (\d+)
alert_all: tags=THE_VALUE
... which would output "109" if logfile.log contained the line "Type: 109".
This feature can also be used, for example, to remove unwanted parts of log file lines - eg. time-stamps. The line:
Sun Jan 03 20:53:54 GMT 2016 [Thread-2]: Warning: Restart in progress
... could be captured as simply "Warning: Restart in progress" by using:
capture: Thread.\d+.: (.+)
This feature is also useful for de-duplicating multiple alerts containing different time-stamps for when they are tagged with RED, AMBER or GREEN to appear in the Fault Management View.
Where an alert is of the form:
<string>: <numeric value>
... and one of the tags for the Schnauzer starts with "METRIC_", the Server will write the numeric value captured to a file in the Server's root folder called "metrics/YYYY-MM-DD/<metric_name>" - where:
For example:
file: logfile.log
capture: Type: (\d+)
alert_all: tags=METRIC_A_VALUE
... will append a line to "metrics/YYYY-MM-DD/A_VALUE" each time "Type: <value>" is encountered in logfile.log. A graph of the metric's values can then be viewed using the Performance Management View.
Here's the same example again, but this time the #hostname# variable is used to ensure that the metric is written to a file unique to a particular host:
file: logfile.log
capture: Type: (\d+)
alert_all: tags=METRIC_A_#hostname#
Use of #hostname# is essential where multiple hosts share the same Uhoh Client configuration, but we require unique sets of metrics to be collected from each host.
The alert_count directive triggers a periodic alert which indicates the number of times a particular regular expression has been matched in a log file over a set period of time.
For example, the following:
file: logfile.log
match: Type: \d+
alert_count: tags=RED seconds=60
... triggers an alert every minute which contains the number of times the string "Type: <number>" appears in logfile.log. The "RED" tag means that this alert will be displayed as a red (highest priority) alarm in the Fault Management View.
If an alert_count tag starts with METRIC_, then the Server will log the value contained in the alert to "metrics/YYYY-MM-DD/<metric_name>".
It is also possible to perform basic mathematical operations on values captured from log files using the alert_total, alert_minimum, alert_maximum and alert_average directives. This feature is useful for charting, for example, periodic minimum, maximum and average handling times for a web service captured from an Apache HTTPD access log file. These all require a "capture" directive to be used and will output a value after the period of time given by the seconds=<n> directive.
Here are some examples:
file: logfile.log
capture: Type: (\d+)
alert_total: tags=METRIC_TOTAL seconds=60
file: logfile.log
capture: Type: (\d+)
alert_minimum: tags=METRIC_MINIMUM seconds=60
file: logfile.log
capture: Type: (\d+)
alert_maximum: tags=METRIC_MAXIMUM seconds=60
file: logfile.log
capture: Type: (\d+)
alert_average: tags=METRIC_AVERAGE seconds=60
Note that no alert is triggered for average, minimum or maximum if nothing has been captured from the log file during the specified interval.
The alert_total and alert_average directives can also be used for threshold alerting by specifying a minimum or maximum value plus alert message, for example:
file: logfile.log
capture: Type: (\d+)
alert_total: tags=ALERT_TOTAL seconds=60 minimum=10 message=Total is below ten
file: logfile.log
capture: Type: (\d+)
alert_total: tags=ALERT_TOTAL seconds=60 maximum=10 message=Total is above ten
file: logfile.log
capture: Type: (\d+)
alert_total: tags=ALERT_TOTAL seconds=60 minimum=5 minimum=10 message=Total is outside of range 5-10
Or:
file: logfile.log
capture: Type: (\d+)
alert_average: tags=ALERT_AVERAGE seconds=60 minimum=10 message=Average is below ten
file: logfile.log
capture: Type: (\d+)
alert_average: tags=ALERT_AVERAGE seconds=60 maximum=10 message=Average is above ten
file: logfile.log
capture: Type: (\d+)
alert_average: tags=ALERT_AVERAGE seconds=60 minimum=5 minimum=10 message=Average is outside of range 5-10
The above are extremely useful for raising alerts based on component throughput and average handling times exceeding design thresholds.
The alert_range directive is used to trigger a periodic alert if the number of matches in a log file falls outside a set range during that time period. For example, the following raises an alert if there are less than 3 or more than 6 instances of the string "Type X" appearing in logfile.log within a minute:
file: logfile.log
match: Type X
alert_range: tags=AMBER,X_RANGE seconds=60 minimum=3 maximum=6
Directives minimum= and maximum= can be specified on their own, for example, the following could be used to raise an alert if the file logfile.log hasn't been updated at all over the last minute (useful for detecting if a program has entered a "hung" state) and we require this check to only be active between 07:00 and 22:00 on any day of the week:
file: logfile.log
match: .+
active: 1234567;07:00-22:00
alert_range: tags=INACTIVE seconds=60 minimum=1
(Note that .+ matches one or more of any character - ie. any line in the log file.)
The following example triggers an alert if more than 20 lines containing "Exception" appear in logfile.log within a five minute period:
file: logfile.log
match: Exception
alert_range: tags=RED,EXCEPTION seconds=300 maximum=20
Directive alert_range, just like alert_all can be provided with a customised message to include in the alert instead of the match count. For example:
alert_range: tags=RED,EXCEPTION seconds=300 maximum=20 message=Out of bounds
The next set of configuration directives are used to examine disk usage and running processes:
Directive alert_disk is used to check available space on a disk.
The following raises an alert if utilised space on the root filesystem exceeds 80%:
file: /
alert_disk: tags=ROOT_FS,RED maximum=80
Multiple alert_disk directives can be used together, so if the following is combined with the above:
file: /
alert_disk: tags=ROOT_FS,AMBER maximum=70
... then an Amber alert (in the Fault Management View) will appear if root filesystem space exceeds 70%, and then a Red alert will appear if space exceeds 80%.
To monitor for the presence (or non-presence) of processes, use the alert_process directive.
Firstly, we need to set the command used to fetch the process table. This is done using the ps_command directive. For example, on Linux we could set this as follows:
ps_command: ps -fe
On Windows, we would use:
ps_command: tasklist.exe
Then, we define pattern matches and alert_process statements. For example, to raise an alert if no Apache httpd process are running or if more than 20 such processes are running:
match: httpd
alert_process: tags=AMBER,APACHE minimum=1 maximum=20
Note that we have to specify both minimum and maximum unlike with alert_count. If we would like an alert to be triggered if anything other than a set number of processes are running, then we specify exactly instead of minimum and maximum.
For example, if only one instance of "chart_load" should be running at any time:
match: chart_load
alert_process tags=CHARTING exactly=1
The Uhoh Client can also run periodic commands and raise alerts if any lines in the command output match a regular expression. This is generally used for custom monitoring by running a script and is achieved using the alert_cmd directive.
The following runs "df -i ." every 150 seconds and captures any lines which contain a number, followed by one or more spaces, then another number, then another one or more spaces. The active directive means that the command will only be run from 17:00 - 19:00 on Saturday and Sunday.
match: \d+\s+\d+\s+
active: 17;17:00-19:00
alert_cmd: tags=CMD,GREEN seconds=150 command=df -i .
If an alert_cmd tag starts with METRIC_, and the command output is of the form:
<string>: <value>
... then the Uhoh Server will log the value contained in the alert to the file <date>/<metric_name> within the metrics folder.
The alert_cmd directive can also be used with capture rather than match, for example:
capture: DISK2\s+\d+\s+(\d+)
alert_cmd: tags=CMD,GREEN seconds=150 command=df -i .
The above will find lines that are of the form:
... and return the value of <number2> only.
The alert_cmd directive can also be provided with threshold values which apply when used in conjunction with the capture directive. For example:
capture: DISK2\s+\d+\s+(\d+)
alert_cmd: tags=CMD,GREEN seconds=150 maximum=10 command=df -i .
... will raise a second alert (in addition to the alert containing the actual value of the metric), containing the text Exceeded Upper Bound if the value captured exceeds 10. Both the metric and threshold alerts will be tagged with the same tags - CMD,GREEN in this case. Likewise:
capture: DISK2\s+\d+\s+(\d+)
alert_cmd: tags=CMD,GREEN seconds=150 minimum=5 command=df -i .
... will raise a second alert, containing the text Below Lower Bound if the value captured is below 5.
Both the maximum and minimum parameters can be used in the same alert_cmd declaration:
capture: DISK2\s+\d+\s+(\d+)
alert_cmd: tags=CMD,GREEN seconds=150 minimum=5 maximum=10 command=df -i .
So that a different set of alert tags can be used for the metric logging and metric threshold events, alternative tags can be specified using the threshold_tags parameter to alert_cmd, as follows:
capture: DISK2\s+\d+\s+(\d+)
alert_cmd: tags=CMD,METRIC_LEVEL_#hostname# seconds=150 minimum=5 threshold_tags=RED command=df -i .
The above configuration will result in:
If threshold_tags isn't specified, then both alerts use the tags specified by the tags parameter.
The final set of directives available for use with an Uhoh Client are:
In order to check whether a TCP server is ready to accept incoming connections (for example, a web server), use the alert_tcp directive.
The following Uhoh Client configuration line runs a check every 60 seconds to determine whether port 80 at 127.0.0.1 is accepting connections:
alert_tcp: tags=WEB_SRV,RED seconds=60 ip=127.0.0.1 port=80 timeout=2 message=Apache HTTPD is not running
The timeout= parameter specifies that the Uhoh Client will only wait for two seconds before giving up. The message= parameter contains the message to log if the connection test fails and must be the last parameter specified on the alert_tcp line.
Correlation of multiple alerts to generated further alerts is achieved using the alert_multi directive. In this way, an alert will triggered if an Uhoh Client detects a series of alerts containing specific tags in a watchlist within a set period of time. Here's an example:
file: test.log
match: type_1
alert_all: tags=TAG1,INFO
file: test.log
match: type_2
alert_all: tags=TAG2,INFO
alert_multi: tags=RED,DERIVED seconds=60 collect=TAG1,TAG2 message=Both types found
Here we've set up two file consumers (alert_all) which detect log lines containing "type_1" and "type_2". Each consumer triggers an alert containing either TAG1 or TAG2. The alert_multi line uses it's collect attribute to specify that if other alerts containing the tags TAG1 and TAG2 have been raised in the last 60 seconds, then an alert tagged with RED,DERIVED and the message text "Both types found" will be raised.
As the correlation using alert_multi takes place within a single Uhoh Client instance, a slightly modified configuration pattern is required if correlation of alerts needs to be performed across multiple Uhoh Clients. The pattern runs an Uhoh Client on the same host as the Uhoh Server and this Client is used to correlate the alerts using the Uhoh Server's Log Stream as a source. Here's how this would be achieved:
The Uhoh Client configuration file for Host_A would specify:
file: test.log
match: type_1
alert_all: tags=MULTITAG1
The Uhoh Client configuration file for Host_B would specify:
file: test.log
match: type_1
alert_all: tags=MULTITAG2
The Uhoh Client configuration file for the host running the Uhoh Server would contain the following in order to capture MULTITAG1 and MULTITAG2 from the Log Stream, re-tag them as MCRR1 and MCRR2 and then correlate using tags MCRR1 and MCRR2:
file: server.log
match: MULTITAG1
alert_all: tags=MCRR1
file: server.log
match: MULTITAG2
alert_all: tags=MCRR2
alert_multi: tags=RED,DERIVED seconds=60 collect=MCRR1,MCRR2 message=Both types found
Choose the tags used for correlation carefully so as to avoid "loops" where a derived alert then re-triggers the same correlation.
It's also possible to configure a Client to receive alert notifications as incoming REST web hooks. To set this up, use the alert_rest directive:
alert_rest: port=<tcp_port_number>
For example:
alert_rest: port=5656
REST requests use a URL formed as follows:
/alert/<TAGS>/<MESSAGE>
For example, using the CURL command:
curl 'http://192.168.1.20:5656/alert/RED,MSGQ/Inbound_queue_server:%020Queue%20is%20full'
Note that '%20' is used to represent spaces in the message text and that the alert hostname will be the name of the host on which the recipent Uhoh Client is running.
You may prefer to incorporate instrumentation into an application in order to feed alerts and metrics into Uhoh rather than have Uhoh read an application's log file. The simplest way to do this is for the application to send a UDP message to the Uhoh Client containing a string in the following format:
INJECT%%<TAGS>%%<MESSAGE>
For example, a shell-script could send an alert to Uhoh using the netcat (nc) command-line utility:
echo "INJECT%%RED,TEST_QUEUE%%This is a test alert" | nc -4u -w0 127.0.0.1 8888
Using UDP means that the application instrumentation isn't tightly-coupled to the Uhoh Client which could affect performance of the application.
After an Uhoh Client configuration file has been edited on the Uhoh Server, the Uhoh Client which uses this configuration file will need to re-load the configuration in order to pick up the changes. There are two ways that this an be done:
For the second option, above, create a file in the Uhoh Server's installation folder called .reset which contains the host name for the Uhoh Client which needs re-configuring. The Uhoh Server, on finding the .reset file, will send a reset message to any Uhoh Client host listed in the file. (The .reset file can contain multiple host names.) The Uhoh Client will then reset itself to remove all monitoring configuration and return itself to it's start-up state. It will then request configuration from the Uhoh Server and start monitoring as normal. Note that the Uhoh Client doesn't actually re-start itself during this procedure.
If an Uhoh Client is closed down, for example if the host it is running on is to be decommissioned, then the Uhoh Server will generate alerts indicating that the Uhoh Client in question is no longer running until the time specified via the client_remove_time Uhoh Server configuration parameter. However, if you need to remove the hostname of an Uhoh Client from an Uhoh Server's watchlist before client_remove_time occurs, create a file called .forget in the Uhoh Server's installation folder containing the hostname of the Uhoh Client to be removed from the watchlist.
Note that this action will need to be performed on all Uhoh Servers within a resilient set.