Here we go for another benchmark!
This time, we will be using more recent versions of xgboost and LightGBM. But also, with a new bazooka server!
I am keeping below the explanation about node interleaving (NUMA vs UMA), for those interested.
The server is in the top tier for benchmarking, in case you are interested.
Specifications:
60.83 cb, rank 1 on the 36 core category
Faster than a i9-7980XE overclocked at 5.85 GHz!
6323 cb, rank 1 on the 36 core category
Faster than a quad Intel Xeon E7 4890 v2! (60c/120t, 3.4/2.8 GHz)
I am using my Dual Xeon 6154 setup (Skylake-SP) with 4x 64GB RAM sticks.
The number of sockets (and the RAM stick placements) define how memory is spread across all the CPUs and physical cores. We call each socket's associated RAM a "NUMA node".
A 2S (2-socket) Skylake-SP setup has:
In my scenario, I reach only 34% saturation already: I have 4 DDR4 sticks only, 2 for each CPU.
A BIOS setting called "Node Interleaving" allows to force how memory is allocated, and can prove itself useful in the cases of programming languages not assigning memory directly from themselves (R without Rcpp, Python, etc.):
Node Interleaving disabled (NUMA) means:
On Skylake-SP, the metrics are approximately the following:
Having NUMA means you know very well how to allocate the memory (numactl
in Linux), and how to handle CPU affinity.
As the number of NUMA nodes (number of sockets) increases, the performance without memory allocation optimization decreases significantly.
Node Interleaving enabled (UMA) means:
On Skylake-SP, the metrics are approximately the following:
Having UMA means you do not have to deal with with memory allocation, as memory going anywhere is fine.
As the number of NUMA nodes (number of sockets) theoretically increases, the performance remains invariant.
Example: what happens to the performance if you allocate manually the memory to the right or wrong place?
Locale vs remote NUMA binding: floating point workloads
From:
NUMA Best Practices for Dell PowerEdge 12th Generation Servers
AMD Ryzen / Threadripper / EPYC CPUs, although they look like having only one CPU, are actually multiple sockets. The same is true for Intel Xeons with Sub NUMA Clustering enabled (disabled by default, for obvious performance optimization reasons).
Examples:
Unfortunately, this time we will not be testing Intel Compilers (icc). Keep in mind icc is usually faster, by a very small margin after all the enhancements done on gcc since version 4.9.
However, for benchmarking, Intel Compilers are usually the fastest (are they optimized for them actually?)...
gcc -fopenmp -DNTIMES=1000 stream.c -O3 -march=native -o stream_gcc
./stream_gcc
icc -qopenmp -DNTIMES=1000 stream.c -O3 -xCORE-AVX512 -o stream_icc
./stream_icc
e26b5d6
-O3 -mtune=native
3ad9cba
-O3 -mtune=native
Installing xgboost directly from R:
devtools::install_github("Laurae2/xgbdl")
xgbdl::xgb.dl(compiler = "gcc", commit = "e26b5d6", use_avx = FALSE, use_gpu = FALSE)
Installing LightGBM directly from R:
devtools::install_github("Laurae2/lgbdl")
lgbdl::lgb.dl(commit = "3ad9cba", compiler = "gcc")
Hyperparameters, average of 7 runs (approximately 162h):
Note: we are separating 0 to 50 iterations, and 1 to 50 iterations for comparison purposes.
Hyperparameters, average of 7 runs (approximately 4h):
Note: we are separating 0 to 50 iterations, and 1 to 50 iterations for comparison purposes.
Use the Performance Analysis if you expect to compare timings data.
Check interactively on Tableau Public:
Provided dynamic and interactive filters: