Node Interleaving Benchmarks

May 2018 Edition

Node Interleaving what the hell?

I am using my Dual Xeon 6130 setup (Skylake-SP) with 12x 32GB RAM sticks.

The number of sockets (and the RAM stick placements) define how memory is spread across all the CPUs and physical cores. We call each socket's associated RAM a "NUMA node".


A 2S (2-socket) Skylake-SP setup has:

  • 2 CPUs (processors), each linked with 1 RAM bank
  • 2 banks (with 6 channels of limited bandwidth) of 12 DDR4 sticks (2 per channel), for a maximum of 768GB RAM
  • UPIs (with limited bandwidth) to make the bridge between the two CPUs
  • Each CPU has direct access to 12 DDR4 sticks, indirect access to the 12 other DDR4 sticks, but the bandwidth is already 97% saturated with 6 DDR4 sticks for each CPU


In my scenario, I reach that 97% saturation already: I have 12 DDR4 sticks only, 6 for each CPU.

A BIOS setting called "Node Interleaving" allows to force how memory is allocated, and can prove itself useful in the cases of programming languages not assigning memory directly from themselves (R without Rcpp, Python, etc.):

  • Node Interleaving Enabled = Unified Memory Access
  • Node Interleaving Disabled = Non Unified Memory Access (default setting in virtually every existing motherboard)

NUMA: Non Unified Memory Access

Expert Mode

UMA: Unified Memory Access

Bulletproof Mode

Node Interleaving disabled (NUMA) means:

  • Memory is allocated, if undefined, to the first RAM bank available (good luck having it to default to the 2nd CPU)
  • If the requested memory is at the right RAM bank the CPU needs, it has the lowest latency possible
  • If the requested memory is not at the right RAM bank the CPU needs, it incurs a latency cost (to go through the UPI)


On Skylake-SP, the metrics are approximately the following:

  • On the right RAM bank: 85 nanoseconds, 100 GBps
  • On the bad RAM bank: 135 nanoseconds, 34.5 GBps


Having NUMA means you know very well how to allocate the memory (numactl in Linux), and how to handle CPU affinity.

As the number of NUMA nodes (number of sockets) increases, the performance without memory allocation optimization decreases significantly.

Node Interleaving enabled (UMA) means:

  • Memory is allocated in a round-robin fashion: it just goes... everywhere in a round-robin fashion
  • Having the memory at the wrong place does not matter, in theory half of it is at the right place already, half of it is at the wrong place
  • It "averages" the latency and the bandwidth penalties of NUMA


On Skylake-SP, the metrics are approximately the following:

  • On any bank: 118 nanoseconds, 125 GBps


Having UMA means you do not have to deal with with memory allocation, as memory going anywhere is fine.

As the number of NUMA nodes (number of sockets) theoretically increases, the performance remains invariant.

Example: what happens to the performance if you allocate manually the memory to the right or wrong place?

Locale vs remote NUMA binding: floating point workloads


From:

NUMA Best Practices for Dell PowerEdge 12th Generation Servers

AMD Ryzen / Threadripper / EPYC CPUs, although they look like having only one CPU, are actually multiple sockets. The same is true for Intel Xeons with Sub NUMA Clustering enabled (disabled by default, for obvious performance optimization reasons).

Examples:

  • AMD Threadripper 1950X is a single CPU, dual socket processor (2x 8 physical cores)
  • AMD EPYC 7401p is a single CPU, quad socket processor (4x 6 physical cores)
  • Two AMD EPYC 7601 is a dual CPU, eight socket processor (8x 8 physical cores)
  • Intel Xeon Gold 6130 with Sub NUMA Clustering is a single CPU, dual socket processor (2x 8 physical cores)
  • Intel Xeon Gold 6130 without Sub NUMA Clustering is a single CPU, single socket processor (1x 16 physical cores)
  • Two Intel Xeon Gold 6130 without Sub NUMA Clustering is a dual CPU, dual socket processor (2x 16 physical cores)

Datasets Used

Higgs

  • 11,000,000 training observations (all)
  • 406 features (interaction (multiplication) of all pairwise combinations of the 28 features)
  • Metric: time taken to finish 10 boosting iterations


  • 4,466,000,000 elements
  • 665,073,059 zero values (14.89% sparsity)
  • 35,728,026,424 bytes (33.3GB)
  • Peak RAM requirement: 82GB

Servers Used (Hardware / Software)

1-64 thread runs

  • CPU: Dual Xeon Gold 6130 (3.7/2.8 GHz, 32c/64t)
  • RAM: 384GB 2666MHz RAM
  • OS: Windows 10 Enterprise + Windows Subsystem for Linux, Ubuntu 16.04
  • Virtualization: None
  • R 3.5, compiled
  • WSL: gcc 5.4
  • BIOS Setting: node interleaving either Enabled or Disabled

Gradient Boosted Trees Algorithms Used

xgboost

  • Versions used: commit 8f6aadd
  • Flags used:
    • gcc: -O3 -mtune=native

LightGBM

  • Versions used: commit 3f54429
  • Flags used:
    • gcc: -O3 -mtune=native

Installation of Gradient Boosted Trees Algorithms

xgboost

Installing xgboost directly from R:

devtools::install_github("Laurae2/xgbdl")
xgbdl::xgb.dl(compiler = "gcc", commit = "8f6aadd", use_avx = TRUE, use_gpu = FALSE)

LightGBM

Installing LightGBM directly from R:

devtools::install_github("Laurae2/lgbdl")
lgbdl::lgb.dl(commit = "3f54429", compiler = "gcc")

Hyperparameters Used (Full list)

xgboost

Hyperparameters, average of 5 runs (approximately 48h):

  • Depth: 8
  • Leaves: 255
  • Hessian: 1
  • Minimum Loss to split: 0
  • Column Sampling: 100%
  • Row Sampling: 100%
  • Iterations: 10
  • Learning Rate: 0.25
  • Boosting method: gbdt, fast histogram
  • Bins: 255
  • Loss function: binary:logistic


Note: the timing takes into account the binning construction time, which is approximately 50% to 70% of the xgboost timing.

It takes 13 minutes with 1 thread, 2 minutes with 64 threads.

LightGBM

Hyperparameters, average of 5 runs (approximately 14h):

  • Depth: 8
  • Leaves: 255
  • Hessian: 1
  • Minimum Loss to split: 0
  • Column Sampling: 100%
  • Row Sampling: 100%
  • Iterations: 10
  • Learning Rate: 0.25
  • Boosting method: gbdt
  • Bins: 255
  • Loss function: binary


Note: the timing does not take into account the binning construction time.

It takes 16 minutes using 1 thread, 23 seconds using 64 threads.

Performance Analysis (Node Interleaving)

Use the Performance Analysis if you expect to compare timings data.

Check interactively on Tableau Public:

https://public.tableau.com/views/NodeInterleavingv1/Dashboard?:bootstrapWhenNotified=true&:display_count=y&:display_static_image=y&:embed=y&:showVizHome=no&publish=yes

Provided dynamic and interactive filters:

  • Threads