Gaussian: Benchmark of G16 and Nvidia Tesla P100 GPU Acceleration

Gaussian 16 (G16) has implemented the general-purpose graphics processing unit (GPGPU) that can be used for enhancing the speedup of quantum calculation. GPGPU in G16 can only be used with Nvidia GPU Tesla series including K20, K40, and P100 (on the day of this writing I am using G16 revision B.01). GPGPU is now available for accelerating Hartree-Fock (HF) and Density functional theory (DFT) methods, particularly gradient and frequencies (hessian) calculations. From the recommendation of G16 developers, the GPGPU is not effective for n-th order Møller–Plesset (MPn) nor Coupled Cluster (CC) calculations as well as small calculations. Thus G16-GPGPU should only be used for large calculations.

In this post, I am going to show you a benchmark of the GPU speedup of the state-of-the-art Nvidia Tesla P100 SXM 16GB using G16 revision B.01 compiled with GPGPU versus the regular CPU-based version of this G16 runtime. The test calculation is the geometry optimization of vomilenine in gas phase using DFT.

Compute Node Specification

CPU model: Intel Xeon Gold 6148 2.40GHz CPU
System Memory: 192 GB
Accelerator model: Nvidia Tesla P100-SXM2 16GB
Number of card: 4
Number of accelerator per card: 1
Linux OS: Red Hat Enterprise Linux 7.3 x86_64
Intel Parallel Studio XE 2018
Interconnect technology: Intel Omni-Path MPI

Tesla P100 SXM 16 GB

Preparation of G16 input for exploiting GPGPU

G16 is very friendly for users, even newbies. Likewise, it is easy to prepare an input file for running a calculation using CPU and GPU. First, you have to know how many CPU cores and GPU your machine has. Use the command lscpu to check the available CPU ranks and use nvidia-smi to check the GPU utilization of your machine's Nvidia GPU.

In my case, I have 40 CPU cores and 4 GPUs. 4 cores out of all CPU cores will be used to control the 4 GPUs. So, the number of active CPU cores for the calculation will be 36 cores. This means that the total number of CPU and GPU used for the computation are 36 CPU cores and 4 GPUs, respectively. To set up a GPGPU input file based on this allocation, I replace the line of %nprocshared=N with the following lines

%CPU=0-39

%GPU=0-3=36-39

I call this CPU36+GPU4, which means that the calculation will

use 40 CPUs: from no. 0th - 39th.
use 4 GPUs: no. 1st, 2nd, 3rd, and 4th.
use 4 CPU cores for controlling 4 GPUs: no. 36th, 37th, 38th, and 39th.

For more details please consult http://gaussian.com/gpu/.

My G16-GPU calculation is submitted by using command

g16 < input > output 2>&1

2>&1 is used to print stderr and stdout to output file.

To make sure that your G16 calculation is actually using GPGPU, use nvidia-smi utility and check at the beginning of the output file. Below is an example of nvidia-smi interface and a part of the output file.

GPU utilization

Calculation using CPU8+GPU4 processors. 4 GPUs are used for GPGPU and 4 of 12 CPU cores are used to control those GPUs.

Computational details

Geometry optimization of vomilenine using B3LYP
Basis set/basis functions: 6-31G(d)/419 and 6-311++G(3df,3pd)/1371

Here is input file https://pastebin.com/B6GfC1Kc.

Structure of Vomilenine

Benchmark Results

Benchmark-G16-GPGPU

Concluding remarks

I found that GPU accelerator, as compared with GPU-free, significantly speedup the calculation.
For the calculation using a small basis set, increasing the number of GPUs does not significantly increase the parallel efficiency.
The parallelism performance of G16-GPU for large calculations (the biggest basis set) is clearly apparent rather than that when computing small calculations.
I apparently see the hidden ability of G16-GPU when running larger calculations (the biggest basis set).

Rangsiman Ketkaew