Monit & MMonit
I've been using a very simple monitoring package called Monit. It's not an IPMI or SNMP aware monitoring package, but it's simplicity in setup, and built-in services monitoring is appealing for the situations I needed to track. A monit agent runs on each target node. You may then install mmonit on a central monitoring management server, who will track all targets and report in the form of a nice "green-light/red-light" web interface. What got me interested in monit, besides the fact that it's FOSS, was the easy configuration. An example monit configuration file:
A Monit control file example:
#
# Monit control file
#
set daemon 120 # Poll at 2-minute intervals
set logfile syslog facility log_daemon
set alert foo@bar.baz
set httpd port 2812 and use address localhost
allow localhost # Allow localhost to connect
allow admin:Monit # Allow Basic Auth
check system myhost.mydomain.tld
if loadavg (1min) > 4 then alert
if loadavg (5min) > 2 then alert
if memory usage > 75% then alert
if swap usage > 25% then alert
if cpu usage (user) > 70% then alert
if cpu usage (system) > 30% then alert
if cpu usage (wait) > 20% then alert
check process apache
with pidfile "/usr/local/apache/logs/httpd.pid"
start program = "/etc/init.d/httpd start" with timeout 60 seconds
stop program = "/etc/init.d/httpd stop"
if 2 restarts within 3 cycles then timeout
if totalmem > 100 Mb then alert
if children > 255 for 5 cycles then stop
if cpu usage > 95% for 3 cycles then restart
if failed port 80 protocol http then restart
group server
depends on httpd.conf, httpd.bin
check file httpd.conf
with path /usr/local/apache/conf/httpd.conf
# Reload apache if the httpd.conf file was changed
if changed checksum
then exec "/usr/local/apache/bin/apachectl graceful"
check file httpd.bin
with path /usr/local/apache/bin/httpd
# Run /watch/dog in the case that the binary was changed
if failed checksum then exec "/watch/dog"
include /etc/monit/mysql.monitrc
include /etc/monit/mail/*.monitrc
One of the features I liked was the ability to use "conditional logic" in determining the alert action. For example,
if 2 restarts within 3 cycles then timeout
This says to timeout the service if it had to be restarted 2 times within 3 polling intervals. Another example:
if loadavg (1min) > 4 for 5 cycles then alert
This says that we generate an alert if the load average is greater than 4 for 5 polling cycles.
As you can see, the configuration file is easy to interpret, human readable. You also noticed that the monitoring program will take corrective action either in the form of restarting the service, or generating an alert.
Monit provides a web interface that can be used to not just query monitor status but to also control the monitoring of configured services. The MMonit package extends the basic (free) monit program by adding a central monitoring service, with historical tracking of events. The MMonit package is proprietary software, with support licenses of Basic (eu: 129 - 10 clients), Professional (eu: 229 - unlimited clients), Premium (eu: 998 - unlimited clients - source access). License is a one-time payment (non-reccuring cost) and does not expire.
Whether or not you decide to implement corrective action on events will depend upon your systems architecture. If you are running a heartbeat/pacemaker cluster with built-in monitoring, you will not want your monitoring agents to restart the services. The heartbeat/pacemaker packages have their own monitor/restart services, and you don't want two different services fighting each other to restart applications.