OpenACC

Tutorial

https://devblogs.nvidia.com/parallelforall/openacc-example-part-1/

Try for part 1-3

Check to see how the performance is improved from each step and why.

===

For example from step1:

After make;

I got

[chantana@XTK40c step1]$ ./laplace2d_acc

Jacobi relaxation Calculation: 4096 x 4096 mesh

0, 0.250000

100, 0.002397

200, 0.001204

300, 0.000804

400, 0.000603

500, 0.000483

600, 0.000403

700, 0.000345

800, 0.000302

900, 0.000269

total: 77.270656 s

Accelerator Kernel Timing data

/home/chantana/openacc-codes-example/posts/002-openacc-example/step1/../common/timer.h

main NVIDIA devicenum=0

time(us): 29,099,362

85: compute region reached 1000 times

85: data copyin transfers: 1000

device time(us): total=8,866 max=39 min=7 avg=8

88: kernel launched 1000 times

grid: [128x1024] block: [32x4]

device time(us): total=2,125,425 max=2,150 min=2,110 avg=2,125

elapsed time(us): total=2,173,907 max=2,225 min=2,157 avg=2,173

88: reduction kernel launched 1000 times

grid: [1] block: [256]

device time(us): total=269,734 max=274 min=269 avg=269

elapsed time(us): total=288,788 max=322 min=286 avg=288

88: data copyout transfers: 1000

device time(us): total=15,060 max=27 min=12 avg=15

85: data region reached 2000 times

85: data copyin transfers: 4000

device time(us): total=6,404,090 max=1,745 min=1,595 avg=1,601

97: data copyout transfers: 4000

device time(us): total=6,459,089 max=1,789 min=1,603 avg=1,614

97: compute region reached 1000 times

100: kernel launched 1000 times

grid: [128x1024] block: [32x4]

device time(us): total=891,460 max=897 min=891 avg=891

elapsed time(us): total=943,373 max=987 min=938 avg=943

97: data region reached 2000 times

97: data copyin transfers: 4000

device time(us): total=6,468,102 max=1,778 min=1,614 avg=1,617

106: data copyout transfers: 4000

device time(us): total=6,457,536 max=1,777 min=1,603 avg=1,614

--The first problem is the copytime with "kernel"

Then, we move to step2:

add option: -ta=nvidia,time

make it again and run:

Jacobi relaxation Calculation: 4096 x 4096 mesh

0, 0.250000

100, 0.002397

200, 0.001204

300, 0.000804

400, 0.000603

500, 0.000483

600, 0.000403

700, 0.000345

800, 0.000302

900, 0.000269

total: 3.562648 s

Accelerator Kernel Timing data

/home/chantana/openacc-codes-example/posts/002-openacc-example/step2/../common/timer.h

main NVIDIA devicenum=0

time(us): 3,333,948

85: data region reached 2 times

85: data copyin transfers: 8

device time(us): total=12,813 max=1,615 min=1,597 avg=1,601

117: data copyout transfers: 10

device time(us): total=12,959 max=1,624 min=26 avg=1,295

91: compute region reached 1000 times

91: data copyin transfers: 1000

device time(us): total=6,546 max=33 min=4 avg=6

94: kernel launched 1000 times

grid: [128x1024] block: [32x4]

device time(us): total=2,118,012 max=2,141 min=2,103 avg=2,118

elapsed time(us): total=2,140,550 max=2,212 min=2,125 avg=2,140

94: reduction kernel launched 1000 times

grid: [1] block: [256]

device time(us): total=269,387 max=274 min=269 avg=269

elapsed time(us): total=291,206 max=389 min=285 avg=291

94: data copyout transfers: 1000

device time(us): total=14,718 max=54 min=11 avg=14

103: compute region reached 1000 times

106: kernel launched 1000 times

grid: [128x1024] block: [32x4]

device time(us): total=899,513 max=904 min=898 avg=899

elapsed time(us): total=923,150 max=962 min=917 avg=923

This is faster. What is reduced?

then step 3:

https://devblogs.nvidia.com/parallelforall/openacc-example-part-2/

The new thing here is using "gang" and "vector"

#pragma acc kernels loop gang(32), vector(16)

for( int j = 1; j < n-1; j++) {

#pragma acc loop gang(16), vector(32)

for( int i = 1; i < m-1; i++ ) ...

- gang(32) clause on the outer loop tells the compiler to launch 32 blocks in the Y (row) direction.

-gang(16) clause on the inner loop tells it to launch 16 blocks in the X (column) direction.

-vector(16) clause on the outer loop tells the compiler to use blocks that are 16 threads tall, thus processing the loop iterations in SIMD groups of 16.

- vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads wide (one warp wide).

It is best to only apply the gang and vector clauses to the inner loop, to ensure the blocks used are as wide as a warp.

#pragma omp parallel for shared(m, n, Anew, A)

#pragma acc kernels loop

for( int j = 1; j < n-1; j++) {

#pragma acc loop gang(16), vector(32)

for( int i = 1; i < m-1; i++ )...

The output is :

Jacobi relaxation Calculation: 4096 x 4096 mesh

0, 0.250000

100, 0.002397

200, 0.001204

300, 0.000804

400, 0.000603

500, 0.000483

600, 0.000403

700, 0.000345

800, 0.000302

900, 0.000269

total: 2.479513 s

Accelerator Kernel Timing data

/home/chantana/openacc-codes-example/posts/002-openacc-example/step3/../common/timer.h

main NVIDIA devicenum=0

time(us): 2,306,372

86: data region reached 2 times

86: data copyin transfers: 4

device time(us): total=6,412 max=1,616 min=1,598 avg=1,603

120: data copyout transfers: 5

device time(us): total=6,437 max=1,608 min=17 avg=1,287

92: compute region reached 1000 times

92: data copyin transfers: 1000

device time(us): total=5,110 max=40 min=4 avg=5

96: kernel launched 1000 times

grid: [16x32] block: [32x16]

device time(us): total=1,448,425 max=1,456 min=1,442 avg=1,448

elapsed time(us): total=1,467,417 max=1,518 min=1,459 avg=1,467

96: reduction kernel launched 1000 times

grid: [1] block: [256]

device time(us): total=6,000 max=6 min=6 avg=6

elapsed time(us): total=23,913 max=77 min=22 avg=23

96: data copyout transfers: 1000

device time(us): total=13,121 max=39 min=10 avg=13

105: compute region reached 1000 times

109: kernel launched 1000 times

grid: [16x1024] block: [32x4]

device time(us): total=820,867 max=828 min=815 avg=820

elapsed time(us): total=841,001 max=860 min=833 avg=841

What are differences?

Code example download

https://github.com/parallel-forall/code-samples/tree/master/posts

Page updated

Report abuse