Title : Automatic Differentiation of Parallel OpenMP Programs
        
Area : Parallel Design Patterns
        
Tool : Tapenade, OpenMP
        
Summary : In recent year, most of the engineering and scientific applications have been written in Fortran/C using OpenMP as the parallelization directives. The scientists and engineers may use some Automatic Differentiation Tools, for example Tapenade, to do the sensitivity analysis and design optimization for some critical processes and algorithms. Calling functions and subroutines in the computer programs are efficient to obtain both the original results and the differentiated values in Tangent or Adjoint modes. In this paper, only Loop Structure, which is a basic design pattern requiring parallel operation, is discussed here. The Tangent, Tangent vector, Adjoint and Adjoint vector have been examined for best performance.
 
Original Loop Structure for Tangent and Adjoint Transformation:
 
C$OMP parallel do private(i,tmp) shared(L,x,y,z)
              do i=1,L
                 tmp=sin(z(i))
                 x(i)=y(i)*tmp
              enddo
C$OMP end parallel do
 
After Tangent or Adjoint differentiation from Tapenade software, the desired transformed program will be expected to look like as Figure 1 and Figure 2.
 
______________________________________________________________________________
C$OMP parallel do private(i,tmp,tmp_omp) shared(L,x,x_omp,y,y_omp,z,z_omp)
do i=1,L 
tmp = sin(z(i))
x(i)=y(i)*tmp
!Tangent Transformation                                                            !Adjoint Transformation
tmp_omp=z_omp(i)*cos(z(i))                                                      y_omp(i) = y_omp(i) + tmp*x_omp(i)
x_omp(i)=y_omp(i)*tmp + y(i)*tmp_omp                                      tmp_omp = y(i)*x_omp(i)
end do                                                                                     z_omp(i) = z_omp(i) + cos(z(i))*tmp_omp
                                                                                               end do    
C$OMP end parallel do
_______________________________________________________________________________
Figure 1. Tangent and Adjoint Transformation
_______________________________________________________________________________

call omp_set_nested(.TRUE.)
C$OMP parallel private(i,tmp,tmp_omp,nd) shared(L,x,x_omp,y,y_omp,z,z_omp,nbdirs)
C$OMP do
do i=1,L                                 
tmp = sin(z(i))
x(i) = y(i)*tmp
C$OMP parallel private(i,tmp,tmp_omp,nd) shared(x_omp,y,y_omp,z,z_omp,nbdirs)
C$OMP do                        
do nd=1,nbdirs  
!Tangent Vector Transformation                      !Adjoint Vector Transformation
tmp_omp(nd) = z_omp(nd, i)*cos(z(i))              y_omp(nd, i) = y_omp(nd, i) &
x_omp(nd, i) = y_omp(nd, i)*tmp &                                       + tmp*x_omp(nd, i)
                       + y(i)*tmp_omp(nd)                 tmp_omp(nd) = y(i)*x_omp(nd, i)
end do                                                                  z_omp(nd, i) = z_omp(nd, i) &
end do                                                                  + cos(z(i))*tmp_omp(nd)
                                                                        end do
                                                                        end do                                         
C$OMP end parallel do
______________________________________________________________________________
Figure 2. Tangent Vector and Adjoint Vector Transformation
 
For workshare of asynchronization, the preliminary results have been obtained. The Tangent and Tangent vector are not required to do the workshare operations because the transformed program statements for derivatives are actually the same as the original program statements but with gradients. While the Adjoint and Adjoint vector programs are examined, the parallel sections can be used to further enhance the performance. Please refer to Figure 3 for the Adjoint transformation with program optimization using parallel sections as an example.
___________________________________________________________________________________
C$OMP parallel sections private(i,tmp,tmp_omp) &
                                     shared(L,x,x_omp,y,y_omp,z,z_omp)
C$OMP section
C$OMP parallel do private(i,tmp) shared(L,x,x_omp,y,y_omp,z)
do i=1,L    
tmp = sin(z(i))
x(i)=y(i)*tmp
y_omp(i) = y_omp(i) + tmp*x_omp(i)
end do
C$OMP end parallel do
C$OMP section
C$OMP parallel do private(i,tmp,tmp_omp) shared(L,x_omp,y,z,z_omp)
do i=1,L
tmp_omp = y(i)*x_omp(i)
z_omp(i) = z_omp(i) + cos(z(i))*tmp_omp
end do
C$OMP end parallel do
C$OMP end parallel sections
___________________________________________________________________________________
Figure 3. Adjoint Transformation with program optimization using parallel sections
 
In addition, vectorization of data and pre-fetching of data for parallel OpenMP programs are important during program compilation in order to obtain the best performance.
 
In the year 2010 and the future, the OpenMP and Coarray Fortran for second and higher order derivatives will be tested and compared using Tapenade. Please feel free to contact the author at karminghenry@sinaman.com for any comments.
 
My research area is in Parallel Optimization in Engineering.
 
Company : Cluster Technology Centre (Tentative Company)