Containerized Applications
Running Tesnsorflow in HPC
Copy the tensorflow files to your home directory and cd to it:
cp -r /usr/local/doc/SINGULARITY/singularity/tensorflow .
cd tensorflow
Interactive job submission
Request a GPU node with 8gb of memory
srun -p gpu -C gpup100 --gres=gpu:1 --mem=8gb --pty bash
Load the Singularity module
module load singularity
Run python Matrix Multiplication code
singularity exec -B /scratch --nv $TENSORFLOW python log-device-placement.py
Output:
2019-05-07 13:54:21.959086: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-07 13:54:22.117470: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla P100-PCIE-12GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:03:00.0
totalMemory: 11.91GiB freeMemory: 11.63GiB
2019-05-07 13:54:22.117517: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-05-07 13:54:22.714286: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-07 13:54:22.714336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-05-07 13:54:22.714345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
...
2019-05-07 13:54:22.716203: I tensorflow/core/common_runtime/placer.cc:927] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
[49. 64.]]
BATCH Job Submission
Find the tensor.slurm job file in the tensorflow directory and submit the job:
sbatch tensor.slurm
Check the output file:
cat slurm-<jobid>.out
You will get the same output.
Running RAPIDS in HPC
RAPIDS accelerates the complete data science pipeline from data ingestion and manipulation to machine learning training.
utilizes NVIDIA CUDA and exposes GPU parallelism and high bandwidth memory speed through user-friendly interfaces like pandas, scikit-learn etc.
With Apache Spark or Dask, RAPIDS can scale out to multi-node, Multi-GPU cluster
Scikit-Learn Vs cuML using RAPIDS
Access Markov Desktop (Interactive Apps) from ondemand.case.edu (with GPU allocation)
Get the jupyter notebook “kmeans_demo.ipynb” at RAPIDSAI Github page or copy it from /usr/local/doc/SINGULARITY/singularity/rapids
cp /usr/local/doc/SINGULARITY/singularity/rapids/kmeans_demo.ipynb .
Load Singularity module and download the latest RAPIDSAI container
module load singularity
(Optional) If you want to install more recent version of the image than existing one, pull the container. Make sure that you are using storage space other than home to avoid quota violation.
singularity pull docker://rapidsai/rapidsai:latest
Add this environment variable:
export SINGULARITYENV_TINI_SUBREAPER=1
Run the RAPIDSAI container.
It is included as an environment variable $RAPIDSAI (check with "module display singularity")
singularity run --nv -B /mnt $RAPIDSAI
Open Jupyter Lab
/conda/envs/rapids/bin/jupyter-lab --allow-root --ip=0.0.0.0 &
You will be prompted to copy and paste on of the URLs
Open the Firefox browser on the same node and type (or simply past link address of) one of the URLs in the browser
http://classt01:8889/?token=xxxx
Start executing the python commands in the jupyter notebook
Although both methods can find the same centroid (within threshold value), the cuML performance is much faster.
The graph shows the results: blue-filled circle for scikit-learn and red circle for cuML.