Benchmark Your Cluster using Intel Distribution for LINPACK Benchmark

Benchmark Your Cluster using Intel Distribution for LINPACK Benchmark

Introduction

TOP500 project ranks and details the 500 most powerful non-distributed computer systems in the world. Do you know how to know the best performance of computer? Have you ever wanted to evaluate the speed/performance of your CPU and compared its power to others? This post will explain you step-by-step how to benchmark or test the performance of your machine!.

To do so, one can use this LINPACK for benchmarking your own PC/Cluster/Supercomputer or whatever that made of CPUs as processor. LINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Jack Dongarra. LINPACK has been mostly used to test the supercomputer, whose best performance was compared and ranked by TOP500, the most famous HPC benchmark ranking website. Both Window and Linux users can download the Intel's High Performance LINPACK (HPL), called Intel Optimize LINPACK Benchmark and Intel Distribution for LINPACK Benchmark from the Intel Math Kernel Library (MKL). The MKL benchmark and developer guide can be found here.

The goal of this benchmark is to measure the performance of Intel CPU, by solving dozens of mathematical equation - Linear equation - measured in FLOPS unit, such as GFlop/s or TFlops. For theoretical peak of performance, the formula to calculate GFlop/s is following

Node performance in GFlops = (CPU speed in GHz) x (number of CPU cores)
x (CPU instruction per cycle) x (number of CPUs per node) 

Intel Distribution for LINPACK Benchmark measures the amount of time it takes to factor and solve a random dense system of linear equations (Ax=b) in real*8 precision . All benchmark presented here are based on Intel Distribution for LINPACK Benchmark library which can be used to run on multiple compute nodes or one compute node with multiple MPI processes.

Intel Xeon Scalable Series, one of the most efficient Intel processors.

Summit: the fastest computer in the world (1st TOP500)

Content

1. Preparation and Prerequisites

1.1 Intel LINPACK Benchmark for SMP and MPI processings

There are two types of Intel MKL Benchmark for LINPACK: shared-memory and distributed-memory versions.

  1. Intel® Optimized LINPACK Benchmark refers to shared-memory version. This version is compiled-code benchmark, which can be found in linpack directory. This benchmark supports only pure OpenMP system. So please turn-off the Intel Hyper-Threading technology. For details can be found here.
  2. Intel® Distribution for LINPACK Benchmark refers to distributed-memory version (use MPI library). This version can be found in mp_linpack directory.

1.2 Get Intel MKL Benchmark

1. Go to https://software.intel.com/en-us/articles/intel-mkl-benchmarks-suite

2. Browse package for OS you are using, available for Linux, Windows, and macOS. Then download to your machine

For example, I used following command to download l_mklb_p_2018.3.011.tgz tgz file to my Linux.

wget http://registrationcenter-download.intel.com/akdlm/irc_nas/9752/l_mklb_p_2018.3.011.tgz

For Windows and macOS, click download link and save zip file to your PC, probably Download folder.

3. Uncompress zip file, the binary file is under sub-directory called benchmark_2018. For example,

  • Linux
/home/rangsiman/intel/compilers_and_libraries_2018.3.222/linux/mkl/benchmarks
  • Window
C:\Users\rangs\Downloads\w_mklb_p_2018.3.011\benchmarks_2018\windows\mkl\benchmarks

1.3 My machine Specification

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Stepping:              2
CPU MHz:               2400.221
BogoMIPS:              4799.29
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23

2. Let's Benchmark Your Machine!

2.1 Use LINPACK: Shared-memory version

1. Browse linpack directory. For example,

cd /home/rangsiman/intel/compilers_and_libraries_2018.3.222/linux/mkl/benchmarks/linpack

2. Run LINPACK

./runme_xeon64

Please note that the number of threads defaults to the number of physical cores. If you want to define the number of threads, set OMP_NUM_THREADS in following script . For example, for 64 bit system

export OMP_NUM_THREADS=4

You can see and change the default setting of number of problem and its size by modifying input file, called lininput_xeon64

vi lininput_xeon64

3. Wait evaluation process is completed. Benchmark results will be being written to text file called lin_xeon64.txt


2.2 Benchmark Results for Shared-memory version

Open lin_xeon64.txt file using text editor

Sample data file lininput_xeon64.

CPU frequency:    3.199 GHz
Number of CPUs: 2
Number of cores: 12
Number of threads: 12

Parameters are set to:

Number of tests: 15
Number of equations to solve (problem size) : 1000  2000  5000  10000 15000 18000 20000 22000 25000 26000 27000 30000 35000 40000 45000
Leading dimension of array                  : 1000  2000  5008  10000 15000 18008 20016 22008 25000 26000 27000 30000 35000 40000 45000
Number of trials to run                     : 4     2     2     2     2     2     2     2     2     2     1     1     1     1     1
Data alignment value (in Kbytes)            : 4     4     4     4     4     4     4     4     4     4     4     1     1     1     1

Maximum memory requested that can be used=16200901024, at the size=45000

=================== Timing linear equation system solver ===================

Size   LDA    Align. Time(s)    GFlops   Residual     Residual(norm) Check
1000   1000   4      0.024      27.2993  9.394430e-13 3.203742e-02   pass
1000   1000   4      0.006      107.6577 9.394430e-13 3.203742e-02   pass
...
content skipped
...
35000  35000  1      99.141     288.3358 1.275258e-09 3.701880e-02   pass
40000  40000  1      146.370    291.5210 1.516881e-09 3.373595e-02   pass
45000  45000  1      213.964    283.9451 2.008430e-09 3.533621e-02   pass

Performance Summary (GFlops)

Size   LDA    Align.  Average  Maximal
1000   1000   4       86.9113  107.6577
...
content skipped
...
40000  40000  1       291.5210 291.5210
45000  45000  1       283.9451 283.9451

Residual checks PASSED

Intel Optimized LINPACK for shared-memory version declares Rmax of my Xeon node is 283.945 GFlop/s

This benchmark was performed with size of problem = 45000, which is default setting. Customization of parameter should be adjustable to improve benchmarking.

3.1 Use LINPACK: Distributed-memory version

1. Browse mp_linpack directory, where LINPACK binary integrated with MPI library is available. For example,

cd /home/rangsiman/intel/compilers_and_libraries_2018.3.222/linux/mkl/benchmarks/mp_linpack

2. Before running LINPACK benchmark for distributed memory system, you need to load necessary environment variable of Intel compiler and MPI library first.

<parent directory>/bin/compilervars.sh intel64 
<mpi directory>/bin64/mpivars.sh 

If you have not installed Intel Parallel Studio XE, which provides both Intel compiler and MPI library, read this post.

3. In HPL.dat file, set size of problem (N) to 10000 (line 6).

10000           Ns

where N (problem size) is the matrix size at which the measured performance was observed.

For real performance run test, Intel suggest that N should be 80% of memory, which would most fit to highest performance of cluster. Therefore N can be estimated using following formula

sqrt(( Memory Size in Gbytes * 1024 * 1024 * 1024 * Number of Nodes) /8 ) * 0.80

For example, in my case, total memory is 32 GB and free memory is 28 GB.

The following is to compute N based on total memory

sqrt(( 32 * 1024 * 1024 * 1024 * 1 ) /8) * 0.80 
= 52428 (~ 53000)

The following is to compute N based on free memory

sqrt(( 28 * 1024 * 1024 * 1024 * 1 ) /8) * 0.80 
= 49042 (~ 50000)

4. The value of Ps and Qs parameters set in HPL.dat file are the number of rows and columns in process grid, respectively.

Note that Ps should less or equal than Qs. In addition, Ps * Qs should equal to number of MPI processors you specified in (4). For example, I set number of MPI processes to 12, so the Ps and Qs in HPL.dat can be as following

1            # of process grids (P x Q)
3            Ps
4            Qs

or (alternatively)

1            # of process grids (P x Q)
2            Ps
6            Qs

or (alternatively)

2            # of process grids (P x Q)
2 3          Ps
6 4          Qs

5. Set swapping threshold in line 27 in HPL.dat to 64, otherwise leave it as default.

64           swapping threshold

6. Set MPI_PROC_NUM and MPI_PER_NODE in runme_intel64_dynamic script.

  • MPI_PROC_NUM is total number of MPI processes
  • MPI_PER_NODE is number of MPI process, this parameter should be equal to 1 or number of socket in your system. Use lscpu command to check how many sockets you machine has.

For example,

export MPI_PROC_NUM=12
export MPI_PER_NODE=2

Execute runme_intel64_dynamic script,

./runme_intel64_dynamic

7. Check output file

less xhpl_intel64_dynamic_outputs.txt


3.2 Benchmark Results for Distributed-memory version

The portion below is example output of LINPACK benchmark on my single node.

================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
...
content skipped
...
N        :   50000
NB       :     192
PMAP     : Column-major process mapping
P        :       3
Q        :       4
PFACT    :   Right
NBMIN    :       2
NDIV     :       2
RFACT    :   Crout
BCAST    :   1ring
DEPTH    :       0
SWAP     : Binary-exchange
L1       : no-transposed form
U        : no-transposed form
EQUIL    : no
ALIGN    :    8 double precision words
--------------------------------------------------------------------------------

....
content skipped
....
castor          : Column=029760 Fraction=0.595 Kernel=301530.44 Mflops=359513.53
castor          : Column=030912 Fraction=0.615 Kernel=296401.39 Mflops=358633.49
castor          : Column=031872 Fraction=0.635 Kernel=288518.77 Mflops=357777.99
castor          : Column=032832 Fraction=0.655 Kernel=275671.09 Mflops=357173.61
castor          : Column=033792 Fraction=0.675 Kernel=261768.88 Mflops=356300.48
castor          : Column=034752 Fraction=0.695 Kernel=251046.82 Mflops=355550.94
castor          : Column=039936 Fraction=0.795 Kernel=216555.89 Mflops=350960.15
castor          : Column=044928 Fraction=0.895 Kernel=142837.25 Mflops=347347.11
castor          : Column=049920 Fraction=0.995 Kernel=60197.71 Mflops=345636.08
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WC00C2R2       50000   192     3     4             249.69            3.33764e+02
HPL_pdgesv() start time Sat Jul 14 14:28:08 2018

HPL_pdgesv() end time   Sat Jul 14 14:32:17 2018

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0033573 ...... PASSED
================================================================================

4. Comparing Benchmark Results with the SUMMIT supercomputer.

(for Distributed-memory version)

4.1 LINPACK Benchmark of SUMMIT supercomputer, ORNL, USA (Top 1# of TOP500)

Spec: IBM Power System AC922, IBM POWER9 22C 3.07GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband

  • CPU cores = 2,282,544
  • Rmax = 122,300.0 TFlop/s
  • Rpeak = 187,659.3 TFlop/s


4.2 LINPACK Benchmark of my Intel Xeon node, Chalawan Head node, Narit, Thailand

Spec: Intel® Xeon® processor E5 v3 CPU cores 24@2.4GHz (HT) RAM Memory 32GB

LINPACK configuration details

  • CPU cores = 12
  • Rmax = 333.764 GFlop/s = 0.334 TFlop/s
  • Rpeak = 460.8 Glop/s = 0.461 TFlop/s (12 physical CPU cores x 2.4 GHz/core x 16 Operation/cycle)


4.3 Efficiency

Another most-used parameter for determining HPC performance is Efficiency, which is the ratio of Rmax to Rpeak (Rmax/Rpeak).

Efficiency of SUMMIT is ( 122,300.0 / 187,659.3 ) x 100 = 65.17 %

Efficiency of my Xeon node is ( 333.764 / 460.8 ) x 100 = 72.43 %


4.4 Improve Your Performance Benchmark

To achieve more reliable benchmark results, I strongly suggest you to read this article in order to seriously enhance the performance benchmark of your cluster with MPI.

5. Use LINPACK on Windows OS

5.1 Instruction

1. Restart your PC/laptop and close all programs before running benchmark test.

2. Navigate to your uncompressed folder of LINPACK, for example,

C:\Users\rangs\Downloads\w_mklb_p_2018.3.011\benchmarks_2018\windows\mkl\benchmarks\linpack

3. Double click LINPACK bat file to run the test

runme_xeon64.bat

Then CMD window appears and wait until process is completed. It can take several minutes. Results will be being written to win_xeon64.txt file.

5.2. My Laptop Spec

Windows 10 Pro 64 bit system Intel Core i7-4750 HQ CPU @ 2.00 GHz RAM Memory 8 GB (7.89 GB usable)

  • CPU frequency: 3.192 GHz
  • Number of CPUs: 1
  • Number of cores: 4
  • Number of threads: 4


5.3 Benchmark Results of my Laptop

  • Rmax : N/A
  • Rpeak: N/A

6. LINPACK on Mobile Device

LINPACK benchmark in not only available for Linux or Windows platforms, also you can test the performance of your mobile phone using LINPACK.

Download LINPACK benchmark app from app store of your phone OS.


Android: Mobile Linpack

  • Xiaomi Redmi Note 3
  • Android 6.0.1
  • Number of cores is 6
  • CPU Frequency: 1.40 GHz
  • RAM size 2.78 GB


iOS: N/A

7. References

Disclaimer:

  • All Intel packages are run under End User License Agreement
  • All packages and program of Intel I has been presenting in this post/website are employed based on qualified student license.

Related contents:

Rangsiman Ketkaew