MPI Benchmarks

About the Benchmarks

Micro-benchmarks 5.4.3 (07/23/18) from The Ohio State University were used to compare performance of MPI modules. Information about the benchmarks can be found here:

http://mvapich.cse.ohio-state.edu/benchmarks/

Comparison Between Modules

Three benchmarks were run to compare performance between modules: latency, bandwidth, multi-pair latency, and bidirectional bandwidth. These are all from the point-to-point subset of benchmarks, meaning communication was between two nodes. In the latency test a sender sends messages of various sizes, and waits for a reply from the receiver, one way latency is measured. In the bandwidth test a sender sends back to back messages of a certain size and waits for a reply from the receiver, which responds only after receiving all messages. The multi-pair latency test is similar to the latency test but multiple pairs are sending messages at the same time. In the bidirectional bandwidth test both nodes are sending messages and receiving messages, testing the aggregate bandwidth by two nodes. More information on these and all the benchmarks can be found on OSU's site.

All benchmarks were run 10 times for each module. Graphs plot the average of these tests, error bars denote one standard deviation. A subset of all MPI modules were tested:

openmpi-2.0/gcc : openmpi version 2.0 compiled with gcc version 4.8.5 with psm2 libraries enabled
openmpi-2.0/intel : openmpi version 2.0 compiled with icc version 17.0.1 with psm2 libraries enabled
mvapich2-2.2/gcc : mvapich2 version 2.2 compiled with gcc version 4.8.5
mvapich2-2.2/intel : mvapich2 version 2.2 compiled with icc version 17.0.1
mvapich2-2.2-psm/gcc : mvapich2 version 2.2 compiled with gcc version 4.8.5 with psm2 libraries enabled
mvapich2-2.2-psm/intel : mvapich2 version 2.2 compiled with icc version 17.0.1 with psm2 libraries enabled
impi : intel mpi compiled with icc version 17.0.1 with psm2 libraries enabled
mpich/gcc : mpich version 3.1.3 compiled with gcc version 4.8.5
mpich/intel : mpich version 3.1.3 compiled with intel version 17.0.1

The two mvapich modules without psm2 libraries enabled failed to run the benchmarks, they are currently considered to be broken. During this benchmark and other tests the following error was produced:

===================================================================================

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES

= PID 93833 RUNNING AT compute035

= EXIT CODE: 255

= CLEANING UP REMAINING PROCESSES

= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

===================================================================================

The data collected using the two mpich modules varied so drastically from all other modules it is not possible to get good resolution on all modules within one graph. Due to this graphs are provided including and not including data for mpich modules.

Graphs without mpich

Based on these benchmarks openmpi has a longer latency that mvapich and impi. The difference in bandwidth is less significant. These benchmarks did not show any significant difference between impi and mvapich or any significant difference between intel vs gcc compilation.

Graphs with mpich

Based on these benchmarks mpich has a latency an order of magnitude longer than the other modules, and a bandwidth 2 orders of magnitude fewer megabytes per second. This is likely due to the lack of psm2 support, because mpich was not compiled using this library support it cannot utilize the Intel® Omni-Path Architecture for MPI communication.

================================================================================================================

Comparison Between Scratch Volumes

At the time these tests were run scratch and scratch2 were configured differently, and users had reported certain MPI versions performing differently on scratch and scratch2. Scratch uses a RAID 6 array with NFS served on IPoIB. Scratch2 uses lvm-cache with an NVME drive in front of a RAID 6 array with NFS over RDMA.

The same benchmarks (point-to-point latency, bandwidth, multi-pair latency, and bidirectional bandwidth) were run to compare performance between modules. All benchmarks were run 10 times for each module. Graphs plot the average of these tests, error bars denote one standard deviation. A subset of all MPI modules were tested:

openmpi-2.0/gcc : openmpi version 2.0 compiled with gcc version 4.8.5 with psm2 libraries enabled
mvapich2-2.2/gcc : mvapich2 version 2.2 compiled with gcc version 4.8.5
mvapich2-2.2-psm/gcc : mvapich2 version 2.2 compiled with gcc version 4.8.5 with psm2 libraries enabled
impi : intel mpi compiled with icc version 17.0.1 with psm2 libraries enabled

There is not a significant difference between the results from the two scratch volumes. This is likely because the benchmarks are not IO intensive enough to exhibit any performance difference.

Report abuse