How to‎ > ‎

Using SNMP as a process watchdog

Simple Network Management Protocol is one of the protocols most used to remotely monitor servers miscellaneous devices. By using SNMP, the management station can collect information, configure devices, servers, routers, printers. Usually, the managed agent reports information about: disk space usage, network traffic, faults occured (fan failure, hardware failure, cable disconnection). Another useful feature of some agents is the ability to monitor various aspects of the host, send traps if specific conditions are met, or raise error flag for various monitored aspects of the host.
More information about SNMP can be found here. The information collected by using SNMP can be put in nice graphs with tools like Cacti, or can be used for alarms with monitoring tools like Nagios.
SNMP can be used to make simple checks in a server running the net-snmp agent, such as testing if a process is alive and to execute the appropiate commands to fix a possible error, without using a complicate process watchdog. The good news is that the watchdog process can run on a different machine.

To use this kind of watchdog, a net-snmp agent must be configured to watch for specific processes and it must have a definition for a 'procfix' command. The procfix command is invoked by the snmpd process running on the agent.

Below is a sample snmpd.conf fragment, from a FreeBSD server running radiusd:

snmpd.conf snippet:
---
proc radiusd
procfix radiusd /usr/local/etc/rc.d/radiusd restart
---

When the process is running fine, a snmpwalk or snmpget will show '0' on errorFlag:
$ snmpwalk -c community -v2c host UCD-SNMP-MIB::prTable
UCD-SNMP-MIB::prIndex.1 = INTEGER: 1
UCD-SNMP-MIB::prNames.1 = STRING: radiusd
UCD-SNMP-MIB::prMin.1 = INTEGER: 0
UCD-SNMP-MIB::prMax.1 = INTEGER: 0
UCD-SNMP-MIB::prCount.1 = INTEGER: 1
UCD-SNMP-MIB::prErrorFlag.1 = INTEGER: noError(0)
UCD-SNMP-MIB::prErrMessage.1 = STRING:
UCD-SNMP-MIB::prErrFix.1 = INTEGER: noError(0)
UCD-SNMP-MIB::prErrFixCmd.1 = STRING: /usr/local/etc/rc.d/radiusd restart

This flag can be checked remotely (if agents' snmpd configuration allows), and a snmpcmd can be triggered,

The script below checks the error flag, and if it is different than zero, will execute snmpcmd, requesting execution for the defined procfix command. The snmp agent configuration must allow 'write' acces for the management host.


snmp-process-watch.sh

#!/bin/sh
# This script uses SNMP to read prError flag on a specific host
# If the error flag is set, a snmpcmd is executed, triggering procfix command

SNMP_COMMUNITY='community'
HOST='host'

SNMPGET='/usr/local/bin/snmpget'
SNMPSET='/usr/local/bin/snmpset'

# If multiple processes are monitored, the index can be added to a loop
PROC_IDX="1"
SOURCE_OID="UCD-SNMP-MIB::prErrorFlag.${PROC_IDX}"
PROC_ERR_STATUS=`$SNMPGET -v2c -c ${SNMP_COMMUNITY} -OvQe ${HOST} ${SOURCE_OID}`

if [ "${PROC_ERR_STATUS}" = "1" ]; then
echo "ERROR for proc #${PROC_IDX}, sendig fix"
${SNMPSET} -v2c -c ${SNMP_COMMUNITY} ${HOST} UCD-SNMP-MIB::prErrFix.${PROC_IDX} = 1
else
# echo "OK Proc Error flag for proc #${PROC_IDX} is 0"
fi


If the script above is executed periodically, a process who has died can be restarted automatically. This can be done using cron. I used this method to fix a problem on a FreeBSD server, which uses RADIUS with MySQL backend, as a workaround for a too slow startup of the MySQL process, making it impossible for radiusd to start properly.

Comments