A profiling study was recently (February 2023) undertaken on Jetstream after it was found that Jetstream is significantly slower than Optima2D on the same case. For both codes, the ILU(p) time was similar, but Jetstream's RHS evaluation time was 5 times that of Optima2D.
A few things were learned from this study:
A significant amount of time (>50%) is spent in memory access compared to floating point operations
SIMD vectorization is very low (<<5%)
For load balanced grids, MPI performance is good with low overhead
The time breakdown of a turbulent RHS evaluation is
Euler contributions = 23%
Viscous contributions = 58%
SA contributions = 18%
SATs = 1%
and future profiling efforts can be targeted accordingly.
A few attempts were made to rectify this in the Euler flux evaluation. However, I believe that there is little low hanging fruit, and if there are significant improvements to be made, they will likely require significant code modification. A few things to focus on are as follows.
Many loops are of the form
do di = 1, 3
it1 = mod(di,3)+1
it2 = mod(di+1,3)+1
do jit1 = 1, blk% jkmmax(it1)
jkm(it1) = jit1
do jit2 = 1, blk% jkmmax(it2)
jkm(it2) = jit2
do jdi = 1, blk% jkmmax(di)
jkm(di) = jdi
j = jkm(1); k = jkm(2); m = jkm(3)
blk% q(j,k,m,1:nvar) = <<< SOME OPERATION >>>
end do
end do
end do !-- it2 loop
end do !-- it1 loop
end do !-- di loop
The use of jkm() makes the stride non-constant which hinders vectorization, as well as making the access order potentially not column-major causing cache misses. A simpler indexing method should be considered in order to facilitate vectorization and cache utilization.
A similar issue is seen with the use of the SBP operators, such as
do m = 1, blk%jkmmax(3)
do k = 1, blk%jkmmax(2)
do j = 1, blk%jkmmax(1)
if (blk%surfTag(j,k,m) == 42) cycle !--degenerate node
do p = blk% Dx(di)% ibeg(j), blk% Dx(di)% iend(j)
jp = j + blk% Dx(di)% ja(p)
blk% s(j,k,m,1:5) = blk% s(j,k,m,1:5) &
+ blk% Dx(di)% as(p) * blk%work(jp,k,m,1:5)
end do
end do
end do
end do
where the use of the p loop inhibits vectorization. For the vast majority of cases, second order operators are used. So, a more efficient implementation that is strictly for second order should be considered. It could be wrapped in a preprocessor directive so that only the SBP or "fixed 2nd order" method would be compiled as necessary.
Note that there are many places in the code where the check seen above for degenerate nodes exists (if (blk%surfTag(j,k,m) == 42) cycle). The compute time for this check is minimal, but it does inhibit vectorization. Currently, since little-to-no vectorization occurs in the code this check does not have a negative impact, but if vectorization were to be performed it should be addressed. For example, it could be wrapped in a preprocessor directive that can be turned on/off at compile time since degenerate grids are rarely used.
The blk derived type (and others, such as bface()) is typified by blk(1:nblk)% q(1:jmax,1:kmax:1:mmax,1:nvar). However, this structure makes for inefficient memory access and low vectorization potential. Often, when a calculation is performed at a given node, multiple data is required at that node, e.g. q(1:nvar), xi(1:3,1:3), etc. However, with the variable values stored in the last index, the stride between values (e.g. q(1) and q(2)) is non-unity which hurts both cache access as well as vectorization. An order such as blk(1:nblk)%q(1:nvar,1:jmax,1:kmax,1:mmax) should be considered. Some tests on a toy problem suggest that if vectorization of a typical loop with the current data structure were possible (it is prevented for the reasons discussed above) a speed-up of 6 times may possible. If the data structure were changed to that presented here, a further 1.75 times speed-up (10.5 times total) may be possible.
A useful seminar on SIMD vectorization from Niagara can be found here.