NWChem: Benchmark of Integral Evaluation Approaches

Integral Evaluation Approaches in NWChem Program

        Short summary for quick read "I found that increasing stack memory, with decreasing global memory (GPFS) can be useful for single-point energy and optimization calculations in NWChem to be more faster. Also writing integral scratch on temporally memory can help."

Integral Evaluation Approaches

I systematically studied the relationship between speed of singlet-energy calculation of Ru(II)-Re(I) complex and integral approachs. For NWChem, in DFT calculation, the inverted charge-density and exchange correlation matrices are computed and written to disk storage. Adjusting integral approach might be generally beneficial for accelerating geometry optimization with constrained density functional theory (CDFT) of NWChem.

The following ‘keywords’ in NWChem differently control the integral scratch files (1).

INCORE: All the integrals are evaluated once and put in memory. Then the matrix is build from them.
DIRECT: Any time you need an integral it is evaluated, but never stored. “On-the-fly”.
SEMIDIRECT: Some of the integrals are stored in memory or on disk, the others are evaluated when needed.
NOIO: Prevent the file storing on disk & memory (No Input/Output). This is pointless.

Computational setup

Task: Performing single-point energy calculation using Constrained-DFT on Ru(II)-Re(I) bridged complex.
Program: NWChem version 6.6 (http://www.nwchem-sw.org/) compiled
MPI architecture: MVAPICH 2.2 (shared memory)
compiler: Intel compiler 2013
Method: CDFT with B3LYP
Basis set: 6-31G(d) for light atoms // SDD-ECP for Ru and Re atoms
Basis function: 1010 functions Cartesian basis function: 1077 functions
Ex. Input file: https://docs.google.com/document/d/1DfwFtrPi5AViZgJgmylw5nNxRwt0xhDBdGbzKxxPLx4/edit?usp=sharing
Ex. Output file: https://docs.google.com/document/d/1I5wHTUWNfx2Fo4FRpaZQ3m5YIp7ZNuwOCPLdyZTtB4o/edit?usp=sharing

Specification of computing Linux machine

Component                  Specification                                                No.

------------------------------------------------------------------------------------------------

CPU:                      Intel® Xeon® processor E5 v3 2.6 GHz (14 cores/ 14 HTs)       x2

Memory:                   128 GB (Total) DDR4-2133 MHz

Disk Drive:               PERC H730P

Network Interface Intel®  Ethernet Controller X710                                      x4

Benchmarking Evaluation of Integral Storage Options

Integral storage choices I used for this evaluation are

1. On-the-fly: Compute integrals when needed with standard mem

direct (same as semidirect filesize 0 memsize 0)

Standard memory: 50% global, 25% stack, and 25% heap

2. no disk: No storing scratch file and use standard memory

semidirect filesize 0

3. no disk + mem size: No storing file and use 100MW on memory

semidirect filesize 0 memsize 100000000

4. no disk + maximize mem: No storing file and use high % stack mem

semidirect filesize 0

specifying memory: 20% global, 75% stack, and 5% heap

5. no disk + STD mem: No storing file and use standard mem

semidirect filesize 0

standard memory: 50% global, 25% stack, and 25% heap

6. local disks + STD. mem: Use local disk on cluster with 100MW memory of standard mem

semidirect filesize 100000000

7. local disk + maximize mem: Use local disk, 100MW memory, and high stack memory

semidirect filesize 100000000

specifying memory: 20% global, 75% stack, and 5% heap

What we found is (Disclaimer: just do not trust this test, it may be wrong, you can do benchmark for NWChem yourself though)

Conclusions

Single-point energy calculation (and geometry optimization as well) using DFT - NWChem should be run in "semi-direct" mode with maximize % of stack memory usage. That is, re-calculating integrals on demand, and not saving them to disk. This is especially important when running on a large cluster with a shared file system. Our evaluations have been observed that an increase of stack memory up to 65 % is at least 1-1.5 times faster than that of the standard memory.

Suggestion

The following commands (controlling memory usage) are suggested for yielding the high efficiency of CDFT optimization. The memory used in all tests is 2 GB per process!.

memory total 2000 stack 1300 heap 100 global 600 mb

dft

    semidirect filesize 0 memsize 100000000

    grid nodisk

end

References

1. NWChem manual: http://www.nwchem-sw.org/index.php/Release66:NWChem_Documentation

Rangsiman Ketkaew