OpenACC

OpenACC

OpenACC (http://openacc.org/) is a new parallel programming standard designed to enable C and FORTRAN programs to easily access GPU. The OpenACC API (Application Program Interface) describes a collection of compiler directives to specify loops and regions of code to be offloaded from a host CPU to an attached accelerator (GPU device).

Important Notes

Running OpenACC Jobs

Interactive Job Submission

Request a GPU node:

srun -p gpu --gres=gpu:1 -N 1 -n 6 --pty /bin/bash

Load PGI compiler (pgi is the default):

module load pgi

For other versions of PGI available, check it out with the command, module avail pgi and load that version using:

module load pgi/<version>

You will also need to load the base module

module load base/8.0

You can get the CUDA driver information issuing a command: 

pgaccelinfo 

Output:

CUDA Driver Version:           11020

NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  460.32.03  Sun Dec 27 19:00:34 UTC 2020

...

  Multi-Device:                Yes

PGI Default Target:            -ta=tesla:cc70

Copy the below C source to a file named "calculate-pi.c" in your home directory. In the C source, we are just adding #pragma acc to run that loop in GPU. To prevent the transfer of data back and forth between the host and the device during computation, which degrade the performance, use the data clause (e.g. #pragma acc data copy(data1), create(data2) ) before kernels and loops. Refer to the references for details.

#include <stdio.h>

#define N 1000000

int main(void) {

  double pi = 0.0f; long i;

  #pragma acc parallel loop

  for (i=0; i<N; i++) {

    double t= (double)((i+0.5)/N);

    pi +=4.0/(1.0+t*t);

  }

  printf("pi=%16.15f\n",pi/N);

  return 0;

}

Compile:

pgcc -ta=nvidia:cc20 -acc -Minfo=accel -o test calculate-pi.c

where,

–ta => target accelerator which is "PGI Default Target:" when running pgacclinefo command as showed above.

-acc enables recognition of OpenACC pragmas and include OpenACC runtime libraries

 -Minfo provides the script info during compilation

-o create object file test

Compiler information using -Minfo=accel provides important information about the performance and whether the code is parallelizable or not:

main:

6, Generating compute capability 2.0 binary

8, Loop is parallelizable

#pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */

CC 2.0 : 20 registers; 2056 shared, 48 constant, 0 local memory bytes; 100% occupancy

10, Sum reduction generated for pi

Executing:

./test

output: pi=3.141592653589877

Time Profiling

You can determine the time spent on different portion of the script just by replacing cc20 with time.

pgcc -ta=nvidia:time -acc -Minfo=accel –o test calculate-pi.c

output:

main

6: region entered 1 time

time(us): total=157456 init=156370 region=1086

kernels=256 data=0

w/o init: total=1086 max=1086 min=1086 avg=1086

8: kernel launched 1 times

grid: [3907] block: [256]

time(us): total=244 max=244 min=244 avg=244

10: kernel launched 1 times

grid: [1] block: [256]

time(us): total=12 max=12 min=12 avg=12

Batch Job

Copy a fortran file in your home directory "laplace2d.f90" from /usr/local/doc/OPENACC/laplace2d.f90

cp /usr/local/doc/OPENACC/laplace2d.f90 ./

Request a gpu node as before and compile the fortran code:

pgf90 -acc -ta=nvidia -Minfo=accel -Mpreprocess -o laplace laplace2d.f90 

Copy the laplace.slurm script to submit your job

#!/bin/bash

#SBATCH --time=00:10:00

#SBATCH -p gpu -C gpufermi --gres=gpu:1

#SBATCH -N 1 -n 6

#SBATCH -o laplace2d.o%j

cp laplace $PFSDIR

cd $PFSDIR

module load cuda

module load pgi

./laplace

cp * $SLURM_SUBMIT_DIR

Submit your job:

sbatch laplace.slurm

Your output should by written in an output file laplace2d.o<jobid>

Using both OpenACC & OpenMP

Work in progress ...

References: