Tutorial
https://devblogs.nvidia.com/parallelforall/openacc-example-part-1/
Try for part 1-3
Check to see how the performance is improved from each step and why.
===
For example from step1:
After make;
I got
[chantana@XTK40c step1]$ ./laplace2d_acc
Jacobi relaxation Calculation: 4096 x 4096 mesh
0, 0.250000
100, 0.002397
200, 0.001204
300, 0.000804
400, 0.000603
500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 77.270656 s
Accelerator Kernel Timing data
/home/chantana/openacc-codes-example/posts/002-openacc-example/step1/../common/timer.h
main NVIDIA devicenum=0
time(us): 29,099,362
85: compute region reached 1000 times
85: data copyin transfers: 1000
device time(us): total=8,866 max=39 min=7 avg=8
88: kernel launched 1000 times
grid: [128x1024] block: [32x4]
device time(us): total=2,125,425 max=2,150 min=2,110 avg=2,125
elapsed time(us): total=2,173,907 max=2,225 min=2,157 avg=2,173
88: reduction kernel launched 1000 times
grid: [1] block: [256]
device time(us): total=269,734 max=274 min=269 avg=269
elapsed time(us): total=288,788 max=322 min=286 avg=288
88: data copyout transfers: 1000
device time(us): total=15,060 max=27 min=12 avg=15
85: data region reached 2000 times
85: data copyin transfers: 4000
device time(us): total=6,404,090 max=1,745 min=1,595 avg=1,601
97: data copyout transfers: 4000
device time(us): total=6,459,089 max=1,789 min=1,603 avg=1,614
97: compute region reached 1000 times
100: kernel launched 1000 times
grid: [128x1024] block: [32x4]
device time(us): total=891,460 max=897 min=891 avg=891
elapsed time(us): total=943,373 max=987 min=938 avg=943
97: data region reached 2000 times
97: data copyin transfers: 4000
device time(us): total=6,468,102 max=1,778 min=1,614 avg=1,617
106: data copyout transfers: 4000
device time(us): total=6,457,536 max=1,777 min=1,603 avg=1,614
--The first problem is the copytime with "kernel"
Then, we move to step2:
add option: -ta=nvidia,time
make it again and run:
Jacobi relaxation Calculation: 4096 x 4096 mesh
0, 0.250000
100, 0.002397
200, 0.001204
300, 0.000804
400, 0.000603
500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 3.562648 s
Accelerator Kernel Timing data
/home/chantana/openacc-codes-example/posts/002-openacc-example/step2/../common/timer.h
main NVIDIA devicenum=0
time(us): 3,333,948
85: data region reached 2 times
85: data copyin transfers: 8
device time(us): total=12,813 max=1,615 min=1,597 avg=1,601
117: data copyout transfers: 10
device time(us): total=12,959 max=1,624 min=26 avg=1,295
91: compute region reached 1000 times
91: data copyin transfers: 1000
device time(us): total=6,546 max=33 min=4 avg=6
94: kernel launched 1000 times
grid: [128x1024] block: [32x4]
device time(us): total=2,118,012 max=2,141 min=2,103 avg=2,118
elapsed time(us): total=2,140,550 max=2,212 min=2,125 avg=2,140
94: reduction kernel launched 1000 times
grid: [1] block: [256]
device time(us): total=269,387 max=274 min=269 avg=269
elapsed time(us): total=291,206 max=389 min=285 avg=291
94: data copyout transfers: 1000
device time(us): total=14,718 max=54 min=11 avg=14
103: compute region reached 1000 times
106: kernel launched 1000 times
grid: [128x1024] block: [32x4]
device time(us): total=899,513 max=904 min=898 avg=899
elapsed time(us): total=923,150 max=962 min=917 avg=923
This is faster. What is reduced?
then step 3:
https://devblogs.nvidia.com/parallelforall/openacc-example-part-2/
The new thing here is using "gang" and "vector"
#pragma acc kernels loop gang(32), vector(16)
for( int j = 1; j < n-1; j++) {
#pragma acc loop gang(16), vector(32)
for( int i = 1; i < m-1; i++ ) ...
- gang(32) clause on the outer loop tells the compiler to launch 32 blocks in the Y (row) direction.
-gang(16) clause on the inner loop tells it to launch 16 blocks in the X (column) direction.
-vector(16) clause on the outer loop tells the compiler to use blocks that are 16 threads tall, thus processing the loop iterations in SIMD groups of 16.
- vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads wide (one warp wide).
It is best to only apply the gang and vector clauses to the inner loop, to ensure the blocks used are as wide as a warp.
#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels loop
for( int j = 1; j < n-1; j++) {
#pragma acc loop gang(16), vector(32)
for( int i = 1; i < m-1; i++ )...
The output is :
Jacobi relaxation Calculation: 4096 x 4096 mesh
0, 0.250000
100, 0.002397
200, 0.001204
300, 0.000804
400, 0.000603
500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 2.479513 s
Accelerator Kernel Timing data
/home/chantana/openacc-codes-example/posts/002-openacc-example/step3/../common/timer.h
main NVIDIA devicenum=0
time(us): 2,306,372
86: data region reached 2 times
86: data copyin transfers: 4
device time(us): total=6,412 max=1,616 min=1,598 avg=1,603
120: data copyout transfers: 5
device time(us): total=6,437 max=1,608 min=17 avg=1,287
92: compute region reached 1000 times
92: data copyin transfers: 1000
device time(us): total=5,110 max=40 min=4 avg=5
96: kernel launched 1000 times
grid: [16x32] block: [32x16]
device time(us): total=1,448,425 max=1,456 min=1,442 avg=1,448
elapsed time(us): total=1,467,417 max=1,518 min=1,459 avg=1,467
96: reduction kernel launched 1000 times
grid: [1] block: [256]
device time(us): total=6,000 max=6 min=6 avg=6
elapsed time(us): total=23,913 max=77 min=22 avg=23
96: data copyout transfers: 1000
device time(us): total=13,121 max=39 min=10 avg=13
105: compute region reached 1000 times
109: kernel launched 1000 times
grid: [16x1024] block: [32x4]
device time(us): total=820,867 max=828 min=815 avg=820
elapsed time(us): total=841,001 max=860 min=833 avg=841
What are differences?
Code example download
https://github.com/parallel-forall/code-samples/tree/master/posts