Simply by adding "omp" and "acc" pragmas in your C++ code, you can not only parallelize it but make it gpu commutable as demonstrated in this "cpp_parallization" tutorial. HPC would like to thank James Mertens in PAT (Particle Astro Theory) group at Case Western Reserve University for his contribution. If anyone would like to make changes or add code beyond what he has done, please do submit a pull request (see GitHub).
The cpp_parellelization repository [1] contains some basic c++ code, set up to run in parallel on either multiple CPU cores or a GPU. The idea is to not only parallelize the code, but do so using some modern c++11 features, wrapping up OpenACC directives in a class.
The code creates a large "3D" array (lattice), and uses an iterative statement to solve an algebraic problem for each point in the lattice. Values calculated and stored at points on the lattice have no dependency on other lattice points, so parallelization can be done in several ways. A comparison can then be made between execution speeds for different parallelization schemes.
References: