Xgb vs Lgb Benchmarks

October 2018 Edition

Another benchmark again?

Here we go for another benchmark!

This time, we will be using more recent versions of xgboost and LightGBM. But also, with a new bazooka server!

I am keeping below the explanation about node interleaving (NUMA vs UMA), for those interested.

The server is in the top tier for benchmarking, in case you are interested.

Specifications:

  • CPU: Dual Xeon Gold 6154:
    • 2x 18 cores / 36 threads
    • 3.7 GHz all turbo (133.2 GHz of raw power...)
    • 3.3 GHz all turbo AVX-256
    • 2.7 GHz all turbo AVX-512
  • RAM: 256 GB 2666 MHz (4x 64GB)

Cinebench R11.5 (rank 2 world, 11/18/2018)

60.83 cb, rank 1 on the 36 core category

Faster than a i9-7980XE overclocked at 5.85 GHz!

Cinebench R15 (rank 33 world, 11/18/2018)

6323 cb, rank 1 on the 36 core category

Faster than a quad Intel Xeon E7 4890 v2! (60c/120t, 3.4/2.8 GHz)

Node Interleaving what the hell?

I am using my Dual Xeon 6154 setup (Skylake-SP) with 4x 64GB RAM sticks.

The number of sockets (and the RAM stick placements) define how memory is spread across all the CPUs and physical cores. We call each socket's associated RAM a "NUMA node".


A 2S (2-socket) Skylake-SP setup has:

  • 2 CPUs (processors), each linked with 1 RAM bank
  • 2 banks (with 6 channels of limited bandwidth) of 12 DDR4 sticks (2 per channel), for a maximum of 768GB RAM
  • UPIs (with limited bandwidth) to make the bridge between the two CPUs
  • Each CPU has direct access to 12 DDR4 sticks, indirect access to the 12 other DDR4 sticks, but the bandwidth is already 97% saturated with 6 DDR4 sticks for each CPU


In my scenario, I reach only 34% saturation already: I have 4 DDR4 sticks only, 2 for each CPU.

A BIOS setting called "Node Interleaving" allows to force how memory is allocated, and can prove itself useful in the cases of programming languages not assigning memory directly from themselves (R without Rcpp, Python, etc.):

  • Node Interleaving Enabled = Unified Memory Access
  • Node Interleaving Disabled = Non Unified Memory Access (default setting in virtually every existing motherboard)

NUMA: Non Unified Memory Access

Expert Mode

UMA: Unified Memory Access

Bulletproof Mode

Node Interleaving disabled (NUMA) means:

  • Memory is allocated, if undefined, to the first RAM bank available (good luck having it to default to the 2nd CPU)
  • If the requested memory is at the right RAM bank the CPU needs, it has the lowest latency possible
  • If the requested memory is not at the right RAM bank the CPU needs, it incurs a latency cost (to go through the UPI)


On Skylake-SP, the metrics are approximately the following:

  • On the right RAM bank: 85 nanoseconds, 100 GBps
  • On the bad RAM bank: 135 nanoseconds, 34.5 GBps


Having NUMA means you know very well how to allocate the memory (numactl in Linux), and how to handle CPU affinity.

As the number of NUMA nodes (number of sockets) increases, the performance without memory allocation optimization decreases significantly.

Node Interleaving enabled (UMA) means:

  • Memory is allocated in a round-robin fashion: it just goes... everywhere in a round-robin fashion
  • Having the memory at the wrong place does not matter, in theory half of it is at the right place already, half of it is at the wrong place
  • It "averages" the latency and the bandwidth penalties of NUMA


On Skylake-SP, the metrics are approximately the following:

  • On any bank: 118 nanoseconds, 125 GBps


Having UMA means you do not have to deal with with memory allocation, as memory going anywhere is fine.

As the number of NUMA nodes (number of sockets) theoretically increases, the performance remains invariant.

Example: what happens to the performance if you allocate manually the memory to the right or wrong place?

Locale vs remote NUMA binding: floating point workloads


From:

NUMA Best Practices for Dell PowerEdge 12th Generation Servers

AMD Ryzen / Threadripper / EPYC CPUs, although they look like having only one CPU, are actually multiple sockets. The same is true for Intel Xeons with Sub NUMA Clustering enabled (disabled by default, for obvious performance optimization reasons).

Examples:

  • AMD Threadripper 1950X is a single CPU, dual socket processor (2x 8 physical cores)
  • AMD EPYC 7401p is a single CPU, quad socket processor (4x 6 physical cores)
  • Two AMD EPYC 7601 is a dual CPU, eight socket processor (8x 8 physical cores)
  • Intel Xeon Gold 6130 with Sub NUMA Clustering is a single CPU, dual socket processor (2x 8 physical cores)
  • Intel Xeon Gold 6130 without Sub NUMA Clustering is a single CPU, single socket processor (1x 16 physical cores)
  • Two Intel Xeon Gold 6130 without Sub NUMA Clustering is a dual CPU, dual socket processor (2x 16 physical cores)

Effect of Compiler

gcc vs icc

Unfortunately, this time we will not be testing Intel Compilers (icc). Keep in mind icc is usually faster, by a very small margin after all the enhancements done on gcc since version 4.9.

However, for benchmarking, Intel Compilers are usually the fastest (are they optimized for them actually?)...

gcc 8.2.0

gcc -fopenmp -DNTIMES=1000 stream.c -O3 -march=native -o stream_gcc
./stream_gcc

icc 19.0.0.117 (gcc 8.2.0 compatibility)

icc -qopenmp -DNTIMES=1000 stream.c -O3 -xCORE-AVX512 -o stream_icc
./stream_icc

Datasets Used

Bosch

  • 1,183,747 training observations (all)
  • 477 features (at most 95% sparse features)
  • Metrics:
    • Time taken to finish 50 boosting iterations
    • Time taken from first boosting iteration being over to the 50th boosting iteration

Servers Used (Hardware / Software)

1-72 thread runs

  • CPU: Dual Xeon Gold 6154 (3.7/3.7 GHz, 36c/72t)
  • RAM: 256GB 2666MHz RAM (4x 64GB RAM)
  • OS: Pop_OS! 18.10
  • Kernel parameters: pti=off spectre_v2=off spec_store_bypass_disable=off l1tf=off noibrs noibpb nopti no_stf_barrier
  • Virtualization: None
  • R 3.5.1, compiled
  • Compiler: gcc 8.2.0
  • BIOS Setting: node interleaving either Enabled or Disabled

Gradient Boosted Trees Algorithms Used

xgboost

  • Versions used: commit e26b5d6
  • Flags used:
    • gcc: -O3 -mtune=native

LightGBM

  • Versions used: commit 3ad9cba
  • Flags used:
    • gcc: -O3 -mtune=native

Installation of Gradient Boosted Trees Algorithms

xgboost

Installing xgboost directly from R:

devtools::install_github("Laurae2/xgbdl")
xgbdl::xgb.dl(compiler = "gcc", commit = "e26b5d6", use_avx = FALSE, use_gpu = FALSE)

LightGBM

Installing LightGBM directly from R:

devtools::install_github("Laurae2/lgbdl")
lgbdl::lgb.dl(commit = "3ad9cba", compiler = "gcc")

Hyperparameters Used (Full list)

xgboost

Hyperparameters, average of 7 runs (approximately 162h):

  • Depth: 8
  • Leaves: 255
  • Hessian: 1
  • Minimum Loss to split: 0
  • Column Sampling: 100%
  • Row Sampling: 100%
  • Iterations: 50
  • Learning Rate: 0.10
  • Boosting method: exact, approx, hist
  • Bins: 255
  • Loss function: binary:logistic


Note: we are separating 0 to 50 iterations, and 1 to 50 iterations for comparison purposes.

LightGBM

Hyperparameters, average of 7 runs (approximately 4h):

  • Depth: 8
  • Leaves: 255
  • Hessian: 1
  • Minimum Loss to split: 0
  • Column Sampling: 100%
  • Row Sampling: 100%
  • Iterations: 50
  • Learning Rate: 0.10
  • Boosting method: gbdt
  • Bins: 255
  • Loss function: binary


Note: we are separating 0 to 50 iterations, and 1 to 50 iterations for comparison purposes.

Performance Analysis (Node Interleaving)

Use the Performance Analysis if you expect to compare timings data.

Check interactively on Tableau Public:

https://public.tableau.com/views/xgboostvsLightGBMspeedOct2018/ModelCentric-1toNiterations?%3Aembed=y&%3AshowVizHome=no&publish=yes

Provided dynamic and interactive filters:

  • Threads
  • Model
  • Method
  • Memory