CUDA
CUDA
CUDA (Compute Unified Device Architecture) was developed by NVIDIA a general purpose parallel computing architecture. It consists of CUDA Instruction Set Architecture (ISA) and parallel compute engine in the NVIDIA GPU (Graphics Processing Unit). The GPU has hundreds of cores that can collectively run thousands of computing threads. This capability complements the ability of a conventional CPU to run serial tasks by permitting the CPU to run the serial portions of an application, to handoff to the GPU parallel subtasks and to manage the complete set of tasks that make up the overall algorithm. Generally, in this model of computing, the best results are obtained my minimizing the communication between CPU (host) and the GPU (device).
Important Notes
GPU job may not work as expected if the SLURM job submission flags miss "--gres=gpu:<n>", where n is either 1 or 2 as the GPU nodes have 2 GPU cards.
Specify the number of gpus (1 or 2) according to your requirement. Most of the applications can use only one GPU. Also, you should be sure that your script (job) is actually using both the GPUs in the node with --gres=gpu:2 else use --gres=gpu:1.
GPU jobs can be run on GPU nodes that are available in gpu/gpufermi queue (refer to HPC Resource View)
Want to start from basic C++ using acc pragmas - OpenACC? vist this site. Also, find the GPU cabable Software in HPC Software Guide.
Installed Versions
All the available versions of CUD Afor use can be viewed by issuing the following command. This applies for other applications as well.
module avail cuda
output:
------------------------- /usr/local/share/modulefiles/Core ------------------------
cuda/7.5 cuda/8.0 (D) cuda/9.0 cuda/9.2 cuda/10.0 cuda/10.1 cuda/11.2
The default version is identified by "(default)" behind the module name and can be loaded as:
module load cuda
The other versions of CUDA can be loaded as:
module load cuda/<version>
Running GPU jobs
Interactive job
For available GPU queues/partition, visit HPC Resource View. To access nodes in "gpufermi" queue, type:
srun --x11 -p gpu -C gpuk40 --gres=gpu:1 -N 1 -n 2 --time=1:00:00 --mem=5gb --pty /bin/bash
Note: This will request a gpuk40 node type (-C gpuk40) from a gpu queue (-p gpu) with only one gpu (--gres=gpu:1) out of 2Gpus with memory 5gb (--mem=5gb)
OR access nodes in "gpuk40" queue that has latest version of drivers:
srun --x11 -p gpu -C gpuk40 --gres=gpu:1 --pty /bin/bash
OR access nodes in "gpup100" queue that has latest version of drivers:
srun --x11 -p gpu -C gpup100 --gres=gpu:1 --pty /bin/bash
Load the cuda module. (load the latest version of cuda module for "gpu" queue)
module load cuda
To run a deviceQuery executable, please do the following steps:Run the executable:
deviceQuery
You should be able to get information about CUDA. This information such as number of cores, number of multiprocessors etc might be useful during CUDA programming.
This output is from GPUFermi:
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "Tesla M2090"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 5375 MBytes (5636554752 bytes)
(16) Multiprocessors x ( 32) CUDA Cores/MP: 512 CUDA Cores
GPU Clock rate: 1301 MHz (1.30 GHz)
Memory Clock rate: 1848 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per blocks
...
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 2, Device0 = Tesla M2090, Device1 = Tesla M2090
You can also run "nvidia-smi" command:
nvidia-smi
output:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M2090 On | 0000:0B:00.0 Off | 0 |
| N/A N/A P12 29W / 225W | 10MiB / 5375MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 Tesla M2090 On | 0000:0C:00.0 Off | 0 |
| N/A N/A P12 31W / 225W | 10MiB / 5375MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
Note: GPU Util in both the GPUs are 0%. If the GPU jobs are running, you will see some values.
If your application support communication among both the GPUs in a node, use:
srun --x11 -p gpu -C gpuk40 --gres=gpu:2 --pty /bin/bash
If you forget to load the cuda module, you won’t be able to load cuda libraries and encounter an error as follows:
./deviceQuery: error while loading shared libraries: libcudart.so.3: cannot open shared object file: No such file or directory
Compiling Cuda Code
Request a gpu node
srun --x11 -p gpu -C gpuk40 --gres=gpu:1 --pty /bin/bash
Copy hello.cu file from /usr/local/doc/CUDA
cp usr/local/doc/CUDA/hello.cu .
Load the cuda module
module load cuda
Compile:
nvcc hello.cu -o hello
Execute:
./hello
output:
Hello World!
Batch job
Use this slurm script, job.sh
#SBATCH --time=10:00:00
#SBATCH -p gpu -C gpufermi --gres=gpu:1
#SBATCH -N 1 -n 6
#SBATCH -o cuda_test.o%j
module load cuda
deviceQuery
Submit the job:
sbatch <job.sh>
You should be able to obtain the same results as above in cuda_test.o<jobid> file in your working directory.
If your application support communication among both the GPUs in a node, use:
#SBATCH -N 1 -n 1 -p gpu -C gpuk40 --gres=gpu:2
GPU Benchmark
To know about the performance of different GPUs, Tesla K40 and Tesla M20xx, in gpufermi and gpuk40 queues respectively, click on GPU Benchmark @ HPC.
GPU Compute Modes
There are also different GPU compute modes as showed:
Default : Multiple threads can run on this GPU
Exclusive Thread : Only one thread in one process can run on this GPU
Prohibited : No threads are allowed to run on this GPU
Exclusive Process : Many threads in one process will be able to run on this GPU
By default, GPU will be in an exclusive process mode, meaning the GPU can only be accessed by a single process.
Warning:
If two jobs, each requesting one gpu, are simultaneously assigned to the same node (less than 10 sec delay between two jobs), the second job may be terminated with the following error without affecting the first job:
{^H??? Error using ==> gpu_entry
src/cuda/context.cpp:361: CUDA driver error: invalid device (101)
References
Flags: CUDA_ARCH, or -arch, may be called for when compiling code with CUDA. In addition, 'gencodes' may be helpful. The gpuk40 partition has Nvidia K40 gpus, requiring designation SM35 for architecture. Google searches using keywords including 'gencode' and 'arch' may provide more specific information, as well as instructions to compile specific software for user installation on the HPC.