AI at HPC: Tips

Intel Python

GREENE cluster uses Intel CPUs, and thus your code can benefit from Intel distribution of Python. Look at their benchmarks and see if this would be useful for you.

Intel distribution of Python together with library daal4py may make your ML code much faster!

In order to experiment with that on cluster use

module use /share/apps/intel/19.1.2/intelpython3/bin

Note: command below will show available environmental modules for python build with intel compilers. However this is not Intel distribution of Python

module avail python/intel/3.8.6

Tips on handling large data

Parallel execution

General approach

You have several options to make your computations parallel

- Independent tasks
- If you can separate your large job to several independent tasks - do that, and submit a corresponding number of jobs.
- Grid Search across parameter space
- If you may do the grid search on independent nodes
- ML/Data processing algorithms supporting multiple CPUs on single node
- ML/Data processing algorithms supporting multiple CPUs on multiple nodes
- This usually requires an algorithm to be compiled with MPI
  - Splitting data
  - Some algorithms support splitting data, performing computations on independent chunks of data and then combining results
  - Splitting features
  - Some algorithms support splitting by features: performing computations on independent subsets for features and then combining results

Python

It is always good to look at options listed in official documentation: here and here.

If you use Python thread-based parallelism, note that Python uses GIL (Global Interpreter Lock) - at any one time only a single thread can acquire a lock for a Python object. If you use threads, access to global objects will jump from one thread to another.

In case you use Python multiprocessing: there is no access to global variables at all. Objects will be pickled and passed by scheduler to independent processes.

There are other tools addressing the parallelization task like Ray, mpi4py, dask-mpi. We have examples for a couple of options you may find useful:

- joblib (use several cpus on one node, based on locky) example
- DASK example

R

A well written overview of parallelization options for R is presented here https://www.glennklockwood.com/data-intensive/r/

Note: It is important to specify number of cores requested for R correctly

A wrong way

registerDoParallel(detectCores())

No matter how many cores you requested in SLURM, this will instruct R to use all the cores on a node (20, 28, or even 40). A job that requests a number of CPUs but uses more or fewer cores than requested will be detected by monitoring tools and then killed

An acceptable method:

if(Sys.getenv("SLURM_CPUS_PER_TASK") != "") { ncores <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK”))} else { ncores <- detectCores()}registerDoParallel(ncores)

Also set 1 as value for OMP_NUM_THREADS. Some R algorithms will still try to use more cores if OMP_NUM_THREADS is not setup. Thus, please do

# inside R script
Sys.setenv(OMP_NUM_THREADS = “1”)

Spark

Spark provides an alternative method to distribute computations across multiple CPUs and multiple nodes. Spark comes with Spark ML - a collection of spark ML algorithms accessibly using pyspark and sparklyr.

You can read more information here: Spark, MLib, PySpark, sparklyr

Some may find useful algorithms designed by H2O, which can be executed with sparkling water (H2O on top of Spark): Sparkling Water

Example: Spark interactive: Scala, Python, R

Page updated

Report abuse