CUDA

CUDA (Compute Unified Device Architecture) was developed by NVIDIA a general purpose parallel computing architecture. It consists of CUDA Instruction Set Architecture (ISA) and parallel compute engine in the NVIDIA GPU (Graphics Processing Unit). The GPU has hundreds of cores that can collectively run thousands of computing threads. This capability complements the ability of a conventional CPU to run serial tasks by permitting the CPU to run the serial portions of an application, to handoff to the GPU parallel subtasks and to manage the complete set of tasks that make up the overall algorithm. Generally, in this model of computing, the best results are obtained my minimizing the communication between CPU (host) and the GPU (device).

Important Notes

GPU job may not work as expected if the SLURM job submission flags miss "--gres=gpu:<n>", where n is 1 to 8.
Specify the number of gpus (1 to 8) according to your requirement. Most of the applications can use only one GPU. Also, you should be sure that your script (job) is actually using all the GPUs in the node with --gres=gpu:X else use --gres=gpu:1.
GPU jobs can be run on GPU nodes that are available in gpu queue (refer to HPC Resource View)

Installed Versions

All the available versions of CUDA for use can be viewed by issuing the following command. This applies for other applications as well.

module spider CUDA

output:

...

CUDA/12.1.1

CUDA/12.2.0

CUDA/12.3.0

Load the module:

module load CUDA/<version>

Running GPU jobs

Interactive job

For available GPU queues/partition, visit HPC Resource View. To access nodes with L40S GPU card.

srun -p gpu -C gpul40s --gres=gpu:1 -N 1 -n 2 --time=1:00:00 --mem=5gb --pty /bin/bash

Note: This will request a gpul40s node type (-C gpul40s) from a gpu queue (-p gpu) with only one gpu (--gres=gpu:1) and memory 5gb (--mem=5gb)

Load the cuda module

module load CUDA/12.3.0

To run a deviceQuery executable, please do the following steps:

deviceQuery

You should be able to get information about CUDA. This information such as number of cores, number of multiprocessors etc might be useful during CUDA programming.

This output::

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA L40S"

CUDA Driver Version / Runtime Version 12.6 / 12.3

CUDA Capability Major/Minor version number: 8.9

Total amount of global memory: 45488 MBytes (47697362944 bytes)

...

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.6, CUDA Runtime Version = 12.3, NumDevs = 1, Device0 = NVIDIA L40S

Result = PASS

You can also run "nvidia-smi" command:

nvidia-smi

output:

Tue Jan 28 12:54:44 2025

+-----------------------------------------------------------------------------------------+

| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |

|-----------------------------------------+------------------------+----------------------+

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| | | MIG M. |

|=========================================+========================+======================|

| 0 NVIDIA L40S On | 00000000:CA:00.0 Off | 0 |

| N/A 24C P8 31W / 350W | 1MiB / 46068MiB | 0% Default |

| | | N/A |

+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+

| Processes: |

| GPU GI CI PID Type Process name GPU Memory |

| ID ID Usage |

|=========================================================================================|

| No running processes found |

+-----------------------------------------------------------------------------------------+

Note: GPU Util in both the GPUs are 0%. If the GPU jobs are running, you will see some values.

If your application support communication with 2 GPU cards in a node, use:

srun --x11 -p gpu --gres=gpu:2 --pty /bin/bash

If you forget to load the cuda module, you won’t be able to load cuda libraries and encounter an error as follows:

./deviceQuery: command not found

Compiling Cuda Code

Request a gpu node

srun --x11 -p gpu --gres=gpu:1 --pty /bin/bash

Copy hello.cu file from /usr/local/doc/CUDA

cp usr/local/doc/CUDA/hello.cu .

Load the cuda module

module load CUDA/12.3.0

Compile:

nvcc hello.cu -o hello

Execute:

./hello

output:

Hello World!

Batch job

Use this slurm script, job.sh

#SBATCH --time=10:00:00

#SBATCH -p gpu --gres=gpu:1

#SBATCH -N 1 -n 6

#SBATCH -o cuda_test.o%j

module load CUDA/12.3.0

deviceQuery

Submit the job:

sbatch <job.sh>

You should be able to obtain the same results as above in cuda_test.o<jobid> file in your working directory.

GPU Benchmark

To know about the performance of different GPUs, click on GPU Benchmark @ HPC.

GPU Compute Modes

There are also different GPU compute modes as showed:

Default : Multiple threads can run on this GPU
Exclusive Thread : Only one thread in one process can run on this GPU
Prohibited : No threads are allowed to run on this GPU
Exclusive Process : Many threads in one process will be able to run on this GPU

By default, GPU will be in the default mode, and multiple users can submit jobs to the same GPU node.

Page updated

Report abuse