Parallelization of ARWPost
メモ
Auto parallel のテスト
まとめると, Auto parallel にしたら, 約1.7倍の速度向上が見込めます.
特に, CC は configure で設定された"gcc"のままコンパイルしたほうがちょっと速いようです.
☆ Single 版 ☆
FC = ifort
FFLAG = -O3 -xSSE2
CC = gcc
Time: 5m52s
☆ Auto Parallel 版 (その1) ☆
FC = ifort
FFLAG = -O3 -xSSE2 -parallel -par-report1
CC = gcc
Thread = 1
Time: 4m19s
Thread = 2
Time: 4m10s
Thread = 3
Time: 4m12s
Thread = 4
Time: 4m08s
Thread = 5
Time: 4m21s
Thread = 6
Time: 4m06s
☆ Auto Parallel 版 (その2) ☆
FC = ifort
FFLAG = -O3 -xSSE2 -parallel -par-report1
CC = icc
CFLAG = -O3 -xSSE2 -parallel -par-report1
Thread = 1
Time: 4m28s
Thread = 2
Time: 4m10s
Thread = 3
Time: 4m25s
Thread = 4
Time: 4m11s
Thread = 5
Time: 4m23s
Thread = 6
Time: 4m09s
OpenMP化 のテスト
・500x500x55x2 の4次元データでテスト。
OpenMP 版(Intel Compiler)
FC = ifort
FFLAG = -O3 -xSSE4.2 -openmp -openmp-report1 -parallel -par-report1
CC = icc
Thread = 1
Time: 4m00s
Thread = 8
Time: 3m43s
Thread = 16
Time: 3m12s
OpenMP 版(PGI compiler)
FC = pgf90
FFLAG = -O3 -fast -tp sandybridge-64 -mp -Minfo
CC = gcc
Thread = 1
Time: 5m00s
Thread = 2
Time: 5m55s
Thread = 4
Time: 6m13s
Thread = 8
Time: ms
Thread = 12
Time: ms
Thread = 16
Time: ms
Thread = 24
Time: ms
GPGPU化 のテスト
☆ GPGPU 版(PGI accelerator compiler) ☆
FC = pgf90
FFLAG = -O3 -fast -tp sandybridge-64 -mp -Minfo
CC = gcc
Thread = 1
Time: 4m00s
Thread = 2
Time: 4m10s
Thread = 4
Time: 4m12s
Thread = 8
Time: 4m08s
Thread = 12
Time: 4m21s
Thread = 16
Time: 4m21s
Thread = 24
Time: 4m06s