Monit & MMonit

I've been using a very simple monitoring package called Monit. It's not an IPMI or SNMP aware monitoring package, but it's simplicity in setup, and built-in services monitoring is appealing for the situations I needed to track. A monit agent runs on each target node. You may then install mmonit on a central monitoring management server, who will track all targets and report in the form of a nice "green-light/red-light" web interface. What got me interested in monit, besides the fact that it's FOSS, was the easy configuration. An example monit configuration file:

A Monit control file example:

#

# Monit control file

#

set daemon 120 # Poll at 2-minute intervals

set logfile syslog facility log_daemon

set alert foo@bar.baz

set httpd port 2812 and use address localhost

allow localhost # Allow localhost to connect

allow admin:Monit # Allow Basic Auth

check system myhost.mydomain.tld

if loadavg (1min) > 4 then alert

if loadavg (5min) > 2 then alert

if memory usage > 75% then alert

if swap usage > 25% then alert

if cpu usage (user) > 70% then alert

if cpu usage (system) > 30% then alert

if cpu usage (wait) > 20% then alert

check process apache

with pidfile "/usr/local/apache/logs/httpd.pid"

start program = "/etc/init.d/httpd start" with timeout 60 seconds

stop program = "/etc/init.d/httpd stop"

if 2 restarts within 3 cycles then timeout

if totalmem > 100 Mb then alert

if children > 255 for 5 cycles then stop

if cpu usage > 95% for 3 cycles then restart

if failed port 80 protocol http then restart

group server

depends on httpd.conf, httpd.bin

check file httpd.conf

with path /usr/local/apache/conf/httpd.conf

# Reload apache if the httpd.conf file was changed

if changed checksum

then exec "/usr/local/apache/bin/apachectl graceful"

check file httpd.bin

with path /usr/local/apache/bin/httpd

# Run /watch/dog in the case that the binary was changed

if failed checksum then exec "/watch/dog"

include /etc/monit/mysql.monitrc

include /etc/monit/mail/*.monitrc

One of the features I liked was the ability to use "conditional logic" in determining the alert action. For example,

if 2 restarts within 3 cycles then timeout

This says to timeout the service if it had to be restarted 2 times within 3 polling intervals. Another example:

if loadavg (1min) > 4 for 5 cycles then alert

This says that we generate an alert if the load average is greater than 4 for 5 polling cycles.

As you can see, the configuration file is easy to interpret, human readable. You also noticed that the monitoring program will take corrective action either in the form of restarting the service, or generating an alert.

Monit provides a web interface that can be used to not just query monitor status but to also control the monitoring of configured services. The MMonit package extends the basic (free) monit program by adding a central monitoring service, with historical tracking of events. The MMonit package is proprietary software, with support licenses of Basic (eu: 129 - 10 clients), Professional (eu: 229 - unlimited clients), Premium (eu: 998 - unlimited clients - source access). License is a one-time payment (non-reccuring cost) and does not expire.

Whether or not you decide to implement corrective action on events will depend upon your systems architecture. If you are running a heartbeat/pacemaker cluster with built-in monitoring, you will not want your monitoring agents to restart the services. The heartbeat/pacemaker packages have their own monitor/restart services, and you don't want two different services fighting each other to restart applications.