I am using my Dual Xeon 6130 setup (Skylake-SP) with 12x 32GB RAM sticks.
The number of sockets (and the RAM stick placements) define how memory is spread across all the CPUs and physical cores. We call each socket's associated RAM a "NUMA node".
A 2S (2-socket) Skylake-SP setup has:
In my scenario, I reach that 97% saturation already: I have 12 DDR4 sticks only, 6 for each CPU.
A BIOS setting called "Node Interleaving" allows to force how memory is allocated, and can prove itself useful in the cases of programming languages not assigning memory directly from themselves (R without Rcpp, Python, etc.):
Node Interleaving disabled (NUMA) means:
On Skylake-SP, the metrics are approximately the following:
Having NUMA means you know very well how to allocate the memory (numactl
in Linux), and how to handle CPU affinity.
As the number of NUMA nodes (number of sockets) increases, the performance without memory allocation optimization decreases significantly.
Node Interleaving enabled (UMA) means:
On Skylake-SP, the metrics are approximately the following:
Having UMA means you do not have to deal with with memory allocation, as memory going anywhere is fine.
As the number of NUMA nodes (number of sockets) theoretically increases, the performance remains invariant.
Example: what happens to the performance if you allocate manually the memory to the right or wrong place?
Locale vs remote NUMA binding: floating point workloads
From:
NUMA Best Practices for Dell PowerEdge 12th Generation Servers
AMD Ryzen / Threadripper / EPYC CPUs, although they look like having only one CPU, are actually multiple sockets. The same is true for Intel Xeons with Sub NUMA Clustering enabled (disabled by default, for obvious performance optimization reasons).
Examples:
8f6aadd
-O3 -mtune=native
3f54429
-O3 -mtune=native
Installing xgboost directly from R:
devtools::install_github("Laurae2/xgbdl")
xgbdl::xgb.dl(compiler = "gcc", commit = "8f6aadd", use_avx = TRUE, use_gpu = FALSE)
Installing LightGBM directly from R:
devtools::install_github("Laurae2/lgbdl")
lgbdl::lgb.dl(commit = "3f54429", compiler = "gcc")
Hyperparameters, average of 5 runs (approximately 48h):
Note: the timing takes into account the binning construction time, which is approximately 50% to 70% of the xgboost timing.
It takes 13 minutes with 1 thread, 2 minutes with 64 threads.
Hyperparameters, average of 5 runs (approximately 14h):
Note: the timing does not take into account the binning construction time.
It takes 16 minutes using 1 thread, 23 seconds using 64 threads.
Use the Performance Analysis if you expect to compare timings data.
Check interactively on Tableau Public:
Provided dynamic and interactive filters: