Parallelization of ARWPost

メモ

Auto parallel のテスト

    1. まとめると, Auto parallel にしたら, 約1.7倍の速度向上が見込めます.

    2. 特に, CC は configure で設定された"gcc"のままコンパイルしたほうがちょっと速いようです.

☆ Single 版 ☆

    • FC = ifort

    • FFLAG = -O3 -xSSE2

    • CC = gcc

      • Time: 5m52s

☆ Auto Parallel 版 (その1) ☆

    • FC = ifort

    • FFLAG = -O3 -xSSE2 -parallel -par-report1

    • CC = gcc

      • Thread = 1

      • Time: 4m19s

      • Thread = 2

      • Time: 4m10s

      • Thread = 3

      • Time: 4m12s

      • Thread = 4

      • Time: 4m08s

      • Thread = 5

      • Time: 4m21s

      • Thread = 6

      • Time: 4m06s

☆ Auto Parallel 版 (その2) ☆

    • FC = ifort

    • FFLAG = -O3 -xSSE2 -parallel -par-report1

    • CC = icc

    • CFLAG = -O3 -xSSE2 -parallel -par-report1

      • Thread = 1

      • Time: 4m28s

    • Thread = 2

    • Time: 4m10s

    • Thread = 3

    • Time: 4m25s

    • Thread = 4

    • Time: 4m11s

    • Thread = 5

    • Time: 4m23s

      • Thread = 6

      • Time: 4m09s

OpenMP化 のテスト

・500x500x55x2 の4次元データでテスト。

OpenMP 版(Intel Compiler)

    • FC = ifort

    • FFLAG = -O3 -xSSE4.2 -openmp -openmp-report1 -parallel -par-report1

    • CC = icc

      • Thread = 1

      • Time: 4m00s

      • Thread = 8

      • Time: 3m43s

      • Thread = 16

      • Time: 3m12s

OpenMP 版(PGI compiler)

    • FC = pgf90

    • FFLAG = -O3 -fast -tp sandybridge-64 -mp -Minfo

    • CC = gcc

      • Thread = 1

      • Time: 5m00s

      • Thread = 2

      • Time: 5m55s

      • Thread = 4

      • Time: 6m13s

      • Thread = 8

      • Time: ms

      • Thread = 12

      • Time: ms

      • Thread = 16

      • Time: ms

      • Thread = 24

      • Time: ms

GPGPU化 のテスト

☆ GPGPU 版(PGI accelerator compiler) ☆

    • FC = pgf90

    • FFLAG = -O3 -fast -tp sandybridge-64 -mp -Minfo

    • CC = gcc

      • Thread = 1

      • Time: 4m00s

      • Thread = 2

      • Time: 4m10s

      • Thread = 4

      • Time: 4m12s

      • Thread = 8

      • Time: 4m08s

      • Thread = 12

      • Time: 4m21s

      • Thread = 16

      • Time: 4m21s

      • Thread = 24

      • Time: 4m06s