MongoDB - Network Failure Simulation & Monitoring

Post date: Mar 17, 2013 8:05:15 AM

Network failures can be simulated in a local MongoDB cluster too. In a five node cluster with mongod instances running on different ports network failures in the communication with two nodes can be simulated by defining iptables rules to drop outgoing packages from specific ports using the following commands:

sudo iptables -I OUTPUT 1 -p tcp --sport 27111 -j DROP

sudo iptables -I OUTPUT 2 -p tcp --sport 27112 -j DROP

sudo iptables -D OUTPUT 1

sudo iptables -D OUTPUT 1

In this case the rs.status() command will output an exception message exactly like in the case mongod nodes were stopped:

socket exception [CONNECT_ERROR] for localhost:27111

socket exception [CONNECT_ERROR] for localhost:27112

If Mongo Monitoring and Alerting is enabled in Nagios, a critical notification will be raised in this case. Interesting is to evaluate if the remaining nodes can cope with the traffic in a performant manner. In order to check this, the check_queries_per_second method call of Mike Zupan's MongoDB Nagios Plugin located at https://github.com/mzupan/nagios-plugin-mongodb needs to be revised. An important change to be undertaken in order to get consistent values is to fix the concurrency problem by updating the line

db.nagios_check.update(last_count, {'$set': {"data.%s" % query_type : {'count': num, 'ts': int(time.time())}}})

to

db.nagios_check.update({'_id': last_count['_id']}, {'$set': {"data.%s" % query_type : {'count': num, 'ts': int(time.time())}}})

since the value of the last_count variable might have been changed between calls and in such situations the update does not occur at all since the query before the update returns null.

With this fix accurate data for the replica set is retrieved and Nagios displays the number of queries, updates, inserts and deletes per second correctly. From this point we can analyse if the remaining nodes can cope with the current load and perform further adjustments should these be needed.