CUDA/GPU

CUDA

CUDA (Compute Unified Device Architecture) was developed by NVIDIA a general purpose parallel computing architecture. It consists of CUDA Instruction Set Architecture (ISA) and parallel compute engine in the NVIDIA GPU (Graphics Processing Unit). The GPU has hundreds of cores that can collectively run thousands of computing threads. This capability complements the ability of a conventional CPU to run serial tasks by permitting the CPU to run the serial portions of an application, to handoff to the GPU parallel subtasks and to manage the complete set of tasks that make up the overall algorithm. Generally, in this model of computing, the best results are obtained my minimizing the communication between CPU (host) and the GPU (device).

Important Notes

- GPU job may not work as expected if the SLURM job submission flags miss "--gres=gpu:<n>", where n is either 1, 2, 3 or 4 as the GPU nodes have either 2 or 4 GPU cards in each server. Please check HPC Resource View.

Specify the number of gpus (1, 2, 3 or 4) according to your requirement. Most of the applications can use only one GPU. Also, you should be sure that your script (job) is actually using multiple GPUs in the node with --gres=gpu:2,3, or 4 else use --gres=gpu:1.
- GPU jobs can be run on different sets of GPU nodes that are available in gpu queue (refer to HPC Resource View) . Please check the GPU info using deviceQuery and select the GPU group (e.g -C gpuk40, gpup100, gpu2080, gpu2v100, gpu4v100 etc) that your job is compatible with.
Want to start from basic C++ using acc pragmas - OpenACC? vist this site. Also, find the GPU-capable Software in HPC Software Guide.
GPU2080 GPU nodes gput[045-052] do have SSD drive. Please use /tmp ($TMPDIR) as scratch space to use SSDs.
We have limited number of V100 nodes, please limit your number to 1-2 nodes at a time.
Check CUDNN version compatibility (support matrix) at https://docs.nvidia.com/deeplearning/cudnn/support-matrix/. Also, check the compatible CUDA compute capability in GPU node for your application.

Installed Versions

All the available versions of CUDA for use can be viewed by issuing the following command. This applies for other applications as well.

module avail CUDA

output:

CUDA/11.3.1

CUDA/11.4.1

CUDA/11.5.1

CUDA/11.6.0

CUDA/11.7.0

CUDA/12.0.0

The default version is identified by "(D)" behind the module name and can be loaded as:

module load CUDA

The other versions of CUDA can be loaded as:

module load CUDA/<version>

Running GPU jobs

Interactive job

For available GPU queues/partition and features, visit HPC Resource View. To access nodes in "gpu2080" feature, type:

srun -p gpu -C gpu2080 --gres=gpu:1 -N 1 -n 2 --time=1:00:00 --mem=5gb --pty /bin/bash # fpr Pioneer

srun -p markov_gpu --gres=gpu:1 -N 1 -n 2 --time=1:00:00 --mem=5gb --pty /bin/bash

Note: This will request a gpu2080 set of nodes in a gpu queue (-p gpu) with only one gpu (--gres=gpu:1) out of 2 gpus with memory 5gb (--mem=5gb). If you want to access Volta 100, use -C gpu2v100 or gpu4v100.

Load the cuda module. (load the latest version of cuda module for "gpu" queue)

module load CUDA/12.0

To run a deviceQuery executable, please do the following steps:Run the executable:

deviceQuery

You should be able to get information about CUDA. This information such as number of cores, number of multiprocessors, Compute Capability etc might be useful during CUDA programming. GPU jobs can be run on different sets of GPU nodes that are available in gpu queue (refer to HPC Resource View) . Please select the GPU group (e.g -C gpuk40, gpup100, gpu2080 etc) that your job is compatible with.

This output is from GPU GeForce RTX 2080 Ti ( -p gpu -C gpu2080)

$ deviceQuery

.deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce RTX 2080 Ti"

CUDA Driver Version / Runtime Version 10.2 / 8.0

CUDA Capability Major/Minor version number: 7.5

Total amount of global memory: 11019 MBytes (11554717696 bytes)

MapSMtoCores for SM 7.5 is undefined. Default to use 128 Cores/SM

(68) Multiprocessors, (128) CUDA Cores/MP: 8704 CUDA Cores

GPU Max Clock rate: 1545 MHz (1.54 GHz)

Memory Clock rate: 7000 Mhz

Memory Bus Width: 352-bit

L2 Cache Size: 5767168 bytes

Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)

Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers

Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 65536

Warp size: 32

Maximum number of threads per multiprocessor: 1024

Maximum number of threads per block: 1024

Max dimension size of a thread block (x,y,z): (1024, 1024, 64)

Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Concurrent copy and kernel execution: Yes with 3 copy engine(s)

Run time limit on kernels: No

Integrated GPU sharing Host Memory: No

Support host page-locked memory mapping: Yes

Alignment requirement for Surfaces: Yes

Device has ECC support: Disabled

Device supports Unified Addressing (UVA): Yes

Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0

Compute Mode:

< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce RTX 2080 Ti

Result = PASS

You can also run "nvidia-smi" command (use -l flag for looping i.e. -l 1 for every second):

$ nvidia-smi

Wed Apr 29 10:44:19 2020

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 440.64 Driver Version: 440.64 CUDA Version: 10.2 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 GeForce RTX 208... Off | 00000000:02:00.0 Off | N/A |

| 30% 31C P0 54W / 250W | 0MiB / 11019MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

| 1 GeForce RTX 208... Off | 00000000:81:00.0 Off | N/A |

| 36% 36C P0 30W / 250W | 0MiB / 11019MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=============================================================================|

| No running processes found |

+-----------------------------------------------------------------------------+

Note: GPU Util in both the GPUs are 0%. If the GPU jobs are running, you will see some values. Also check the driver version: 440.64

If your application support communication among both the GPUs in a node, use:

srun --x11 -p gpu -C gpu2080 --gres=gpu:2 --pty /bin/bash

If you forget to load the cuda module, you won’t be able to load cuda libraries and encounter an error as follows:

./deviceQuery: error while loading shared libraries: libcudart.so.3: cannot open shared object file: No such file or directory

Compiling Cuda Code

Request a gpu node

srun --x11 -p gpu -C gpu2080 --gres=gpu:1 --pty /bin/bash

Copy hello.cu file from /usr/local/doc/CUDA

cp /usr/local/doc/CUDA/hello.cu .

Load the cuda module

module load cuda

Compile:

nvcc hello.cu -o hello

Execute:

./hello

output:

Hello World!

Batch job

Use this slurm script: job.sh

#SBATCH --time=10:00:00

#SBATCH -p gpu -C gpuk40 --gres=gpu:1

#SBATCH -N 1 -n 6

#SBATCH -o cuda_test.o%j

module load cuda

deviceQuery

Submit the job:

sbatch <job.sh>

You should be able to obtain the same results as above in cuda_test.o<jobid> file in your working directory.

If your application support communication among multiple GPUs in a node, use:

#SBATCH -N 1 -n 1 -p gpu -C gpu2080 --gres=gpu:2 # RTX 2080 Ti

#SBATCH -N 1 -n 1 -p gpu -C gpu4v100 --gres=gpu:4 # Volta 100

GPU Benchmark

To know about the performance of different GPUs (refer to HPC Resource View), click on GPU Benchmark @ HPC.

GPU Compute Modes

Notes:

Shared Mode is not working now. So, use --gres=gpu:x.
The compute mode is set to 0 (the default mode)

There are also different GPU compute modes as showed:

Default : Multiple threads can run on this GPU
Exclusive Thread : Only one thread in one process can run on this GPU
Prohibited : No threads are allowed to run on this GPU
Exclusive Process : Many threads in one process will be able to run on this GPU

nvidia-smi -h

output:

-c, --compute-mode Set MODE for compute applications: 0|DEFAULT, 1|EXCLUSIVE_THREAD, 2|PROHIBITED, 3|EXCLUSIVE_PROCESS

By default, GPU will be in an exclusive process mode. If you want to run mutliple jobs (e.g. GUP_task_ 1 - *_5) in a single GPU, you need to use the shared keyword as showed:

#SBATCH -N 1 -n 1 -p gpu -C gpuk40 --gres=gpu:2:shared

GPU_task_1 &

GPU_task_2 &

GPU_task_3 &

GPU_task_4 &

GPU_task_5

exit

The output form nvidia-smi command showed the multiple threads running as showed:

Warning:

If two jobs, each requesting one gpu, are simultaneously assigned to the same node (less than 10 sec delay between two jobs), the second job may be terminated with the following error without affecting the first job:

{^H??? Error using ==> gpu_entry

src/cuda/context.cpp:361: CUDA driver error: invalid device (101)

References

Flags: CUDA_ARCH, or -arch, may be called for when compiling code with CUDA. In addition, 'gencodes' may be helpful. The gpuk40 partition has Nvidia K40 gpus, requiring designation SM35 for architecture. Google searches using keywords including 'gencode' and 'arch' may provide more specific information, as well as instructions to compile specific software for user installation on the HPC.