OpenACC
OpenACC
OpenACC (http://openacc.org/) is a new parallel programming standard designed to enable C and FORTRAN programs to easily access GPU. The OpenACC API (Application Program Interface) describes a collection of compiler directives to specify loops and regions of code to be offloaded from a host CPU to an attached accelerator (GPU device).
Important Notes
Want to start from basic C++ using acc pragmas? vist this site.
Running OpenACC Jobs
Interactive Job Submission
Request a GPU node:
srun -p gpu --gres=gpu:1 -N 1 -n 6 --pty /bin/bash
Load PGI compiler (pgi is the default):
module load pgi
For other versions of PGI available, check it out with the command, module avail pgi and load that version using:
module load pgi/<version>
You will also need to load the base module
module load base/8.0
You can get the CUDA driver information issuing a command:
pgaccelinfo
Output:
CUDA Driver Version: 11020
NVRM version: NVIDIA UNIX x86_64 Kernel Module 460.32.03 Sun Dec 27 19:00:34 UTC 2020
...
Multi-Device: Yes
PGI Default Target: -ta=tesla:cc70
Copy the below C source to a file named "calculate-pi.c" in your home directory. In the C source, we are just adding #pragma acc to run that loop in GPU. To prevent the transfer of data back and forth between the host and the device during computation, which degrade the performance, use the data clause (e.g. #pragma acc data copy(data1), create(data2) ) before kernels and loops. Refer to the references for details.
#include <stdio.h>
#define N 1000000
int main(void) {
double pi = 0.0f; long i;
#pragma acc parallel loop
for (i=0; i<N; i++) {
double t= (double)((i+0.5)/N);
pi +=4.0/(1.0+t*t);
}
printf("pi=%16.15f\n",pi/N);
return 0;
}
Compile:
pgcc -ta=nvidia:cc20 -acc -Minfo=accel -o test calculate-pi.c
where,
–ta => target accelerator which is "PGI Default Target:" when running pgacclinefo command as showed above.
-acc enables recognition of OpenACC pragmas and include OpenACC runtime libraries
-Minfo provides the script info during compilation
-o create object file test
Compiler information using -Minfo=accel provides important information about the performance and whether the code is parallelizable or not:
main:
6, Generating compute capability 2.0 binary
8, Loop is parallelizable
#pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
CC 2.0 : 20 registers; 2056 shared, 48 constant, 0 local memory bytes; 100% occupancy
10, Sum reduction generated for pi
Executing:
./test
output: pi=3.141592653589877
Time Profiling
You can determine the time spent on different portion of the script just by replacing cc20 with time.
pgcc -ta=nvidia:time -acc -Minfo=accel –o test calculate-pi.c
output:
main
6: region entered 1 time
time(us): total=157456 init=156370 region=1086
kernels=256 data=0
w/o init: total=1086 max=1086 min=1086 avg=1086
8: kernel launched 1 times
grid: [3907] block: [256]
time(us): total=244 max=244 min=244 avg=244
10: kernel launched 1 times
grid: [1] block: [256]
time(us): total=12 max=12 min=12 avg=12
Batch Job
Copy a fortran file in your home directory "laplace2d.f90" from /usr/local/doc/OPENACC/laplace2d.f90
cp /usr/local/doc/OPENACC/laplace2d.f90 ./
Request a gpu node as before and compile the fortran code:
pgf90 -acc -ta=nvidia -Minfo=accel -Mpreprocess -o laplace laplace2d.f90
Copy the laplace.slurm script to submit your job
#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH -p gpu -C gpufermi --gres=gpu:1
#SBATCH -N 1 -n 6
#SBATCH -o laplace2d.o%j
cp laplace $PFSDIR
cd $PFSDIR
module load cuda
module load pgi
./laplace
cp * $SLURM_SUBMIT_DIR
Submit your job:
sbatch laplace.slurm
Your output should by written in an output file laplace2d.o<jobid>
Using both OpenACC & OpenMP
Work in progress ...
References:
Practice directories exercises and solutions: /usr/local/doc/OPENACC (obtained from OpenACC workshop at Pittsburgh oct 16-17 2014.)
Getting Started
OpenACC Specification Guide:
NVIDIA Resources:
CUDA/OPENACC: http://developer.nvidia.com/cuda/openacc
TIPS for Optimization: http://www.nvidia.com/docs/IO/117377/directives-tips-for-c.pdf