Highly Available NFS with DRBD and Heartbeat

Ryan Babchishin <rbabchishin@win2ix.ca>, Win2ix Systems Inc. http://www.win2ix.ca

Intro

This document describes information collected during research and development of a clustered DRBD NFS solution

This project had two purposes:

- HA NFS solution for Media-X Inc.
- Develop a standard tool kit and documentation that Win2ix can use for future projects

Operating System

The standard operating for Win2ix is Ubuntu 12.04, therefore all testing was done with this as the preferred target.

Hardware

Because of the upcoming project with Media-X, computer hardware was chosen based on low cost and low power consumption

A pair of identical systems for the cluster were used:

- SuperMicro SuperServer 5015A-EHF-D525 (default settings)
- Intel(R) Atom(TM) CPU D525 1.80GHz CPU (dual-core, hyper threaded)
- 4GB DDR3 RAM
- 2x750GB 2.5″ Scorpio Black hard drives
- 3ware 9650 RAID-1 controller card with 128MB of RAM, without a battery backup
- 2x on-board Gigabit Ethernet

Partitioning and disk format

The disks were partitioned according to Win2ix standards

- sda1 – / – ext4 – 20GB
- sda2 – /tmp – ext4 – 6GB
- sda3 – /var – ext4 – 6GB
- sdb5 – swap – swap – 2GB
- sda6 – drbd – drbd – 716GB

Networking

- eth0 was configured with a (unused) 192.168.0.[1|2] address for communication over a direct link between systems, with no switch

Bonding was tested, see below for more information.

- eth0 MTU was set to 9000
- eth1 was configured with a regular network IP address for SSH/NFS/etc… access on both systems

Bonding

Although not used due to lack of an extra PCI-E slot, Ethernet bonding was originally tested

Important notes when using bonding:

- To get true single connection load balancing in both directions, use bonding mode 0 (round robin) with NO SWITCH or find a switch that supports it. Direct connections between systems works well. Switches generally support trunking, LACP or some other Cisco variant. They will most likely only send traffic for different IP connections over separate links. This won’t help with DRBD which uses a single connection
- Make sure you are seeing full throughput on your bonding device in both directions by testing with something like iperf

Sample working /etc/network/interfaces configuration segment:

iface bond0 inet static address 192.168.0.1 netmask 255.255.255.0 bond-mode 0 bond-miimon 100 bond-slaves eth0 eth1

Kernel Tuning

These sysctl changes seemed to make a small improvement, so I left them intact. This would need to be added to /etc/sysctl.conf.

# drbd tuning net.ipv4.tcp_no_metrics_save = 1 net.core.rmem_max = 33554432 net.core.wmem_max = 33554432 net.ipv4.tcp_rmem = 4096 87380 33554432 net.ipv4.tcp_wmem = 4096 87380 33554432 vm.dirty_ratio = 10 vm.dirty_background_ratio = 4

DRBD

DRBD is the tricky one. It doesn’t always perform well despite what it’s developers would like you to think.

DRBD Performance

- Our test RAID controller had no battery, but did have 128MB of ram
- DRBD flushes all writes to disk or uses barriers if you don’t disable it. This is important for consistency but disks were not made to be used this way and it will not perform well in most circumstances.
- If you disable all flushing or barriers with no-md-flushes, no-disk-flushes, no-disk-barrier, DRBD performance will be nearly the native disk speed. However this means the DRBD device is more prone to corruption if it crashes, power goes out, etc…
- If you disable just meta-data flushing with no-md-flushes, performance is reasonable (about 80% native) and you still get some security. Meta-data flushing will make performance so bad that there is no point in using DRBD.
- If you have a battery backed RAID controller, disable all flushing and barriers. You will get near native performance and will not have to worry about data corruption
- Variable rate synchronization is excellent because it won’t hurt DRBD performance and will use all bandwidth when it is free
- use-rle is a performance tweak that is enabled in the latest version be default, so I turned it on
- Basic tweaks like buffers, al-extents, etc… should all be enabled. They are well documented.
- Protocol C seems to perform almost the same as A and B (based on benchmarks). Protocol C provides the most protection.
- Our test systems had hard drives there were capable of write speeds very similar to that of 1Gb Ethernet transfer speeds. If faster disks/arrays were used, a faster link or bonding would be required for DRBD to keep up with writes.

Working Example

Working and well performing DRBD resource configuration

resource r0 { net { #on-congestion pull-ahead; #congestion-fill 1G; #congestion-extents 3000; #sndbuf-size 1024k; sndbuf-size 0; max-buffers 8000; max-epoch-size 8000; } disk { #no-disk-barrier; #no-disk-flushes; no-md-flushes; } syncer { c-plan-ahead 20; c-fill-target 50k; c-min-rate 10M; al-extents 3833; rate 35M; use-rle; } startup { become-primary-on nfs01 ; } protocol C; device minor 1; meta-disk internal; on nfs01 { address 192.168.0.1:7801; disk /dev/sda6; } on nfs02 { address 192.168.0.2:7801; disk /dev/sda6; } }

Relevant section of ‘/etc/fstab’ used with this configuration:

# DRBD, mounted by heartbeat /dev/drbd1 /mnt ext4 noatime,noauto,nobarrier 0 0

- ‘nobarrier’ makes a big different in performance (on my test systems) and still maintains filesystem integrity
- ‘noatime’ makes a small performance difference by disabling access time updates on every file read
- ‘noauto’ stops the init scripts (mount -a) from mounting it – heartbeat will manage this

Benchmarking

Before bothering with NFS or anything else, it is a good idea to make sure DRBD is performing well.

Benchmark tools

- atop – watches CPU load, IO load, IO throughput, network throughput, etc… for the whole system in one screen, run on both systems during your benchmarking to see what’s going on
- iptraf – detailed network information, throughput, etc… if you need to dig further
- bonnie++ – performs many IO benchmarks, gives a good idea of actual disk performance. Must use a data set at least 2x larger than physical ram to be accurate
- postmark – great for over loading a system, performing many small write/reads/appends, directory creation, etc… good for testing NFS performance when everything is up. Gives you ops/sec performance results per test type
- dd – basic, initial benchmarking – see dd section

DD

There are some simple tests you can do to test performance of a storage device using DD. However, other tools should be used later for more accurate results (real world). When I’m benchmarking or trying to identify bottlenecks, I run atop on the same system in a separate terminal while dd is transferring data.

- Use direct access to write a sequentially to the filesystem (doesn’t work on all filesystems)

dd if=/dev/zero of=testfile bs=100M count=20 oflag=direct

- Use regular access to write sequentially to the filesystem, and flush before exit

dd if=/dev/zero of=testfile bs=100M count=20 conv=fsync

- Use direct access to read sequentially from the filesystemto read sequentially from the filesystem (doesn’t work on all filesystems)

dd if=testfile of=/dev/null bs=100M iflag=direct

- Use regular access to read sequentially from the filesystem. Drop system cache before doing this or you might just read the file from cache

dd if=testfile of=/dev/null bs=100M

- Drop system cache

sync echo 3 > /proc/sys/vm/drop_caches

- Write directly to block device, bypassing filesystem. This will destroy data on the block device

dd if=/dev/zero of=/dev/sdXX bs=100M count=20 oflag=direct

- Read directly from block device, bypassing filesystem

dd if=/dev/sdXX of=/dev/null bs=100M count=20 oflag=direct

NFS Server

The only configuration was to ‘/etc/exports’:

/mnt 192.168.3.0/24(rw,async,no_subtree_check,fsid=0)

- ‘async’ I found this particular setup performed much better (50%) with async rather than sync
- ‘fsid=0′ is a good thing to use in HA solutions. If all nodes use the same ID# (which is trivial) for the same mount, stale handles will be avoided after a fail over.

‘/etc/idmapd.conf’ may need to be adjusted to match your domain when using NFSv4 (on client and/or server)

NFS Client

In testing, I chose to use this command to mount NFS:

mount nfs:/mnt /testnfs -o rsize=32768,wsize=32768,hard,timeo=50,bg,actimeo=3,noatime,nodiratime

Explanation:

- rsize/wsize – set the read and write maximum block size to 32k, appropriate for the average file size of the customers data
- hard – cause processes accessing the NFS share to block for ever when it becomes unavailable unless killed with SIGKILL
- noatime – do not update file access times every time a file is read
- nodiratime – like noatime but for directories
- bg – continue NFS mount in the background, rather than blocking – mostly to prevent boot problems
- tcp – not specified because it’s a default – if udp is used, data could be lost during a fail over, tcp will keep trying
- timeo – retry NFS requests after 5 seconds (specified in 10ths of a second)

Clustering

Heartbeat without Pacemaker was chosen. Pacemaker seemed too complex and difficult to manage for what was needed.

Heartbeat

It this test setup, heartbeat has 2 Ethernet connections to communicate between nodes. The first is the network/subnet/lan connection and the other is the DRBD direct crossover link. Having multiple connection paths is important so that one heartbeat node doesn’t lose contact with the other. Once that happens, neither one knows which is master and the cluster becomes ‘split-brained’.

STONITH

S.T.O.N.I.T.H = Shoot The Other Node In The Head

STONITH is the facility that Heartbeat uses to reboot a cluster node that is not responding. This is very important because

heartbeat needs to know that the other node is not using DRBD (or other corruptible resources). If a node is really not

responding at all, the other node will reboot it using STONITH, which uses IPMI (in the examples below) and then take over the resources.

When two nodes believe they are master (own the resources) it is called split-brain. This can lead to problems and sometimes data corruption. STONITH w/IPMI can protect against this.

Working Example

- Create below configuration files
- Install heartbeat (from repositories if possible)
- Start logd
- Start heartbeat

The ha.cf file defines the cluster and how it’s nodes interact

/etc/ha.d/ha.cf:

# Give cluster 30 seconds to start initdead 30 # Keep alive packets every 1 second keepalive 1 # Misc settings traditional_compression off deadtime 10 deadping 10 warntime 5 # Nodes in cluster node nfs01 nfs02 # Use ipmi to check power status and reboot nodes stonith_host nfs01 external/ipmi nfs02 192.168.3.33 ADMIN somepwd lan stonith_host nfs02 external/ipmi nfs01 192.168.3.34 ADMIN somepwd lan # Use logd, configure /etc/logd.cf use_logd on # Don't move service back to preferred host when it comes up auto_failback off # If all systems are down, it's failure ping_group lan_ping 192.168.3.1 192.168.3.13 # Takover if pings (above) fail respawn hacluster /usr/lib/heartbeat/ipfail ##### Use unicast instead of default multicast so firewall rules are easier # nfs01 ucast eth0 192.168.3.32 ucast eth1 192.168.0.1 # nfs02 ucast eth0 192.168.3.31 ucast eth1 192.168.0.2

The haresources file describes resources provided by the cluster. It’s format is: [Preferred node] [1st Service] [2nd Service]… services are started in the order they are listed and stopped in the reverse order. They will start on the preferred node when possible.

/etc/ha.d/haresources:

nfs01 drbddisk::r0 Filesystem::/dev/drbd1::/mnt::ext4 IPaddr2::192.168.3.30/24/eth0 nfs-kernel-server

The logd.conf file defines logging for heartbeat

/etc/logd.conf:

debugfile /var/log/ha-debug logfile /var/log/ha-log syslogprefix linux-ha

Testing Fail-over

There are numerous tests you can perform. Try pinging the floating IP address while pulling cables, initiating heartbeat takover, killing heartbeat with SIGKILL, etc… But my favourite test is of the NFS service, the part that matters the most. /var/log/ha-debug will have lots of details about what heartbeat is doing during your tests.

Testing NFS Fail-over

- From another system, mount the NFS share from the cluster
- Use rsync –progress -av to start copying a large file (1-2 GB) to the share
- When the progress is 20%-30%, pull the network cable from the active node
- Rsync will lock up (as intended) due to NFS blocking
- After 5-10 seconds, the file should continue transferring until finished with no errors
- Do an md5 checksum comparison of the original file and the file on the NFS share
- Both files should be identical, if not, there was corruption of some kind
- Try the test again by reading from NFS, rather than writing to it

Page updated

Google Sites

Report abuse

Highly Available NFS with DRBD and Heartbeat

Contents

Intro

Operating System

Hardware

Partitioning and disk format

Networking

Bonding

Kernel Tuning

DRBD

DRBD Performance

Working Example

Benchmarking

DD

NFS Server

NFS Client

Clustering

Heartbeat

Testing Fail-over

Testing NFS Fail-over